博客
关于我
强烈建议你试试无所不能的chatGPT,快点击我
使用zeppelin分析电子商务消费行为
阅读量:4160 次
发布时间:2019-05-26

本文共 14383 字,大约阅读时间需要 47 分钟。

文章目录

一.任务描述

需求概述

  • 对某零售企业最近1年门店收集的数据进行数据分析
  • 潜在客户画像
  • 用户消费统计
  • 门店的资源利用率
  • 消费的特征人群定位
  • 数据的可视化展现

二.问题分析

问题分析1:Customer表

customer_details details
customer_id Int, 1 - 500
first_name string
last_name string
email string, such as willddy@gmail.com
gender string, Male or female
address string
country string
language string
job string, job title/position
credit_type string, credit card type, such as visa
credit_no string, credit card number

问题:language字段数据存在错误

问题分析2:Transaction表

transaction_details details
transaction_id Int, 1 - 1000
customer_id Int, 1 - 500
store_id Int, 1 - 5
price decimal, such as 5.08
product string, things bought
date string, when to purchase
time string, what time to purchase

问题:表中transaction_id有重复,但数据有效,需要修复数据

问题分析3:Store表

transaction_details details
transaction_id Int, 1 - 1000
customer_id Int, 1 - 500
store_id Int, 1 - 5
price decimal, such as 5.08
product string, things bought
date string, when to purchase
time string, what time to purchase

问题分析1:Review表

store_review details
stransaction_id Int, 1 - 8000
store_id Int, 1 - 5
review_store Int, 1 - 5

问题:表中有无效的score数据表中有将transaction_id映射到错误的store_id

三.连接zeppelin

导入电子商务消费行为分析数据及模板

在这里插入图片描述
在这里插入图片描述

使用刚才创建的模板

1.从windows上传到linux 的/tmp/data目录下

2.Understand the Data

%sh## /tmp/data/-- 查看行数cd /tmp/data/wc -l customer_details.csvwc -l store_details.csvwc -l transaction_details.csvwc -l store_review.csv-- 查看头两行head -2 customer_details.csvhead -2 transaction_details.csvhead -2 store_details.csvhead -2 store_review.csv

3. Upload the file to HDFS

%shcd /tmp/data/hdfs dfs -rm -r -f /tmp/shoppinghdfs dfs -mkdir -p /tmp/shopping/data/customerhdfs dfs -mkdir -p /tmp/shopping/data/storehdfs dfs -mkdir -p /tmp/shopping/data/reviewhdfs dfs -mkdir -p /tmp/shopping/data/transactionhdfs dfs -chmod -R 777 /tmp-- 上传数据到hdfshdfs dfs -put customer_details.csv /tmp/shopping/data/customer/hdfs dfs -put transaction_details.csv /tmp/shopping/data/transaction/hdfs dfs -put store_details.csv /tmp/shopping/data/store/hdfs dfs -put store_review.csv /tmp/shopping/data/review/

4.建表查表

4.1 Clear all tables if exists

create database if not exists shoppinguse shopping
-- 创建顾客表create external table if not exists ext_customer_details (customer_id string, --we can use int as wellfirst_name string,last_name string,email string,gender string,address string,country string,language string,job string,credit_type string,credit_no string)row format serde 'org.apache.hadoop.hive.serde2.OpenCSVSerde'location '/tmp/shopping/data/customer' --this must tblproperties tblproperties ("skip.header.line.count"="1")
-- 创建交易流水表create external table if not exists ext_transaction_details (transaction_id string,customer_id string,store_id string,price decimal(8,2),product string,purchase_date string,purchase_time string)row format serde 'org.apache.hadoop.hive.serde2.OpenCSVSerde'location '/tmp/shopping/data/transaction' --this must tblproperties tblproperties ("skip.header.line.count"="1")
-- 创建商店详情表create external table if not exists ext_store_details (store_id string,store_name string,employee_number int)row format serde 'org.apache.hadoop.hive.serde2.OpenCSVSerde'location '/tmp/shopping/data/store' --this must tblproperties tblproperties ("skip.header.line.count"="1")
-- 创建评价表create external table if not exists ext_store_review (transaction_id string,store_id string,review_score int)row format serde 'org.apache.hadoop.hive.serde2.OpenCSVSerde'location '/tmp/shopping/data/review' --this must tblproperties tblproperties ("skip.header.line.count"="1")

4.2 Verify all Tables are Created

%hive--select * from ext_customer_details limit 20--select * from ext_transaction_details limit 20--select * from ext_store_details limit 20select * from ext_store_review limit 20

5.数据清洗

解决以下有问题的数据

  • 对transaction_details中的重复数据生成新ID
  • 过滤掉store_review中没有评分的数据
  • 可以把清洗好的数据放到另一个表或者用View表示
  • 找出PII (personal information identification) 或PCI (personal confidential information) 数据进行加密或hash
  • 重新组织transaction数据按照日期YYYY-MM做分区

5.1 Clean and Mask customer_details

%hive-- 敏感信息加密-- drop view vm_customer_detailscreate view if not exists vm_customer_details asselectcustomer_id ,first_name ,unbase64(last_name) lastname,unbase64(email) email,gender ,unbase64(address) address,country ,language,job ,credit_type ,unbase64(credit_no) credit_nofrom ext_customer_details

5.2 Clean transaction_details into partition table

%hive-- 创建流水详情表create table if not exists transaction_details(transaction_id string,customer_id string,store_id string,price decimal(8,2),product string,purchase_date date,purchase_time string)partitioned by(purchase_month string)-- select transaction_id,count(1) from ext_transaction_details group by transaction_id having count(1)>1-- select * from ext_transaction_details where transaction_id=8001set hive.exec.dynamic.partition.mode=nonstrict -- 开启动态分区-- 重写数据with base as (selecttransaction_id,customer_id ,store_id ,price ,product,purchase_date,purchase_time,from_unixtime(unix_timestamp(purchase_date,'yyyy-MM-dd'),'yyyy-MM') as purchase_month,row_number() over (partition by transaction_id order by store_id) as rnfrom ext_transaction_details)insert overwrite table transaction_details partition(purchase_month)selectif(rn=1,transaction_id,concat_ws('-',transaction_id,'_fix')) ,customer_id ,store_id ,price ,product,purchase_date ,purchase_time,purchase_monthfrom base-- 查看修复信息select * from transaction_details where transaction_id like '%fix%'

5.3 Clean store_review table

create view if not exists vw_store_review asselecttransaction_id,review_scorefrom  ext_store_review where review_score <> ''show tables

最终会出现如下7个表:

在这里插入图片描述

6.Customer分析

6.1找出顾客最常用的信用卡

%hiveselect credit_type,count(distinct credit_no) as credit_cntfromvm_customer_detailsgroup by credit_typeorder by credit_cnt desc

在这里插入图片描述

6.2找出客户资料中排名前五的职位名称

%hiveselect job ,count(1) as pnfrom vm_customer_detailsgroup by joborder by pn desclimit 5

在这里插入图片描述

6.3在美国女性最常用的信用卡

%hiveselect  credit_type,count(1) as ctfrom vm_customer_detailswhere country='United States' and gender =='Female'group by credit_typeorder by ct desc limit 5

在这里插入图片描述

6.4按性别和国家进行客户统计

%hiveselect country,gender, count(1) cntfrom vm_customer_detailsgroup by country,gender

在这里插入图片描述

7.Transaction分析

7.1计算每月总收入

%hiveselect sum(price) as revenue_mon,purchase_monthfrom transaction_detailsgroup by purchase_month

在这里插入图片描述

7.2计算每个季度的总收入

%hivewithbash as(select price, ( concat(year(purchase_date),'-',ceil(month(purchase_date)/3)))as year_quarterfrom transaction_details)select sum(price) revenue_quarterfrom bash

在这里插入图片描述

7.3按年计算总收入

select year(purchase_date),sum(price)from transaction_detailsgroup by year(purchase_date)

在这里插入图片描述

7.4按工作日计算总收入

%hiveselect dayofweek(cast(purchase_date as string))-1 work_date,sum(price)from transaction_detailswhere dayofweek(cast(purchase_date as string)) between 2 and 6group by dayofweek(cast(purchase_date as string))

在这里插入图片描述

7.5/7.6按时间段计算总收入(需要清理数据)

-- 使用正则表达式清理数据然后使用case when 分组查询witht1 as(select *, if(instr(purchase_time,'PM')>0,				if(cast(regexp_extract(purchase_time,'([0-9]{1,2}):([0-9]{2}\\w*)',1)as int)+12>=24,					0,					cast(regexp_extract(purchase_time,'([0-9]{1,2}):([0-9]{2}\\w*)',1)as int)+12),				cast(regexp_extract(purchase_time,'([0-9]{1,2}):([0-9]{2}\\w*)',1)as int)) as timeTransfrom transaction_details), t2 as(select t1.*,case when t1.timeTrans<=8 and t1.timeTrans>5 then 'early morning'				 when t1.timeTrans<=11 and t1.timeTrans>8 then 'morning'				 when t1.timeTrans<=13 and t1.timeTrans>11 then 'noon'				 when t1.timeTrans<=18 and t1.timeTrans>13 then 'afternoon'				 when t1.timeTrans<=22 and t1.timeTrans>18 then 'evening'				 else 'night'			end as timeSplitfrom t1)select t2.timeSplit,sum(price)from t2 group by t2.timeSplit

在这里插入图片描述

7.7按工作日计算平均消费

%hiveselect dayofweek(cast(purchase_date as string))-1 work_date,avg(price)from transaction_detailswhere dayofweek(cast(purchase_date as string)) between 2 and 6group by dayofweek(cast(purchase_date as string))

在这里插入图片描述

7.8计算年、月、日的交易总数

-- 按天计数select purchase_date ,count(1)from transaction_detailsgroup by purchase_date -- 按年计数select year(purchase_date),count(1)from transaction_detailsgroup by year(purchase_date)-- 按月计数select concat(year(purchase_date),'-',month(purchase_date)),count(1)from transaction_detailsgroup by year(purchase_date),month(purchase_date)-- 合计select purchase_date,	count(1) over(partition by year(purchase_date)),	count(1) over(partition by year(purchase_date),month(purchase_date)),	count(1) over(partition by year(purchase_date),month(purchase_date),day(purchase_date))from transaction_details

在这里插入图片描述

7.9找出交易量最大的10个客户

select customer_id,count(1) cfrom transaction_detailsgroup by customer_idorder by c desclimit 10

在这里插入图片描述

7.10找出消费最多的前10位顾客

select customer_id ,sum(price) sfrom transaction_detailsgroup by customer_idorder by s desclimit 10

在这里插入图片描述

7.11统计该期间交易数量最少的用户

select customer_id ,count(1) cfrom transaction_detailsgroup by customer_idorder by c asclimit 1

在这里插入图片描述

7.12计算每个季度的独立客户总数

select concat(year(purchase_date),'年',ceil(month(purchase_date)/3),'季度'),count(distinct customer_id)from transaction_detailsgroup by year(purchase_date),ceil(month(purchase_date)/3)

在这里插入图片描述

7.13计算每周的独立客户总数

select concat(year(purchase_date),'年第',weekofyear(purchase_date),'周'),count(distinct customer_id)from transaction_detailsgroup by year(purchase_date),weekofyear(purchase_date)

在这里插入图片描述

7.14计算整个活动客户平均花费的最大值

select a.customer_id,max(a.av)from(select customer_id,avg(price) avfrom transaction_detailsgroup by customer_id) agroup by a.customer_id;

在这里插入图片描述

7.15统计每月花费最多的客户

select b.m,b.id,b.sfrom(select a.m,a.id,a.s ,row_number() over(partition by  a.m order by a.s desc) as win1from(select concat(year(purchase_date),'-',month(purchase_date)) m,customer_id id,sum(price) sfrom transaction_detailsgroup by year(purchase_date),month(purchase_date),customer_id)a) b where b.win1=1

在这里插入图片描述

7.16统计每月访问次数最多的客户

select b.m,b.id,b.cfrom(select a.m,a.id,a.c,row_number() over(partition by a.m order by a.c desc) as win1 from(select concat(year(purchase_date),'-',month(purchase_date)) m,customer_id id, count(1) cfrom transaction_detailsgroup by year(purchase_date),month(purchase_date),customer_id) a) b where b.win1=1

在这里插入图片描述

7.17按总价找出最受欢迎的5种产品

select product,sum(price) s from transaction_detailsgroup by productorder by s desclimit 5

在这里插入图片描述

7.18根据购买频率找出最畅销的5种产品

select product,count(1) c from transaction_detailsgroup by productorder by c desclimit 5

在这里插入图片描述

7.19根据客户数量找出最受欢迎的5种产品

select product,count(distinct customer_id) c from transaction_detailsgroup by productorder by climit 5

8.Store分析

8.1按客流量找出最受欢迎的商店

select store_id,count(1) c from transaction_detailsgroup by store_idorder by c desclimit 1

在这里插入图片描述

8.2根据顾客消费价格找出最受欢迎的商店

select store_id,sum(price) s from transaction_detailsgroup by store_id order by s desc  limit 1

在这里插入图片描述

8.3根据顾客交易情况找出最受欢迎的商店

select store_id,count(1) c ,sum(price) s from transaction_detailsgroup by store_id order by c desc ,s desc limit 1

8.4根据商店和唯一的顾客id获取最受欢迎的产品

select b.store_id,b.productfrom (select a.store_id,a.product,a.c ,row_number() over(partition by store_id order by a.c desc )as win1 from(select store_id,product,count(distinct customer_id) c from transaction_detailsgroup by store_id,product) a )b where b.win1 =1

在这里插入图片描述

8.5获取每个商店的员工与顾客比

select a.store_id,concat_ws(':',cast(ceil(round(s.employee_number/a.c*100))as string),'100')from(select t.store_id,count(distinct customer_id) cfrom transaction_details t group by t.store_id)a join ext_store_details s on a.store_id=s.store_id

在这里插入图片描述

8.6按年和月计算每家店的收入

-- 按月select  store_id,year(purchase_date),month(purchase_date),sum(price)from transaction_detailsgroup by store_id,year(purchase_date),month(purchase_date)-- 按年 select  store_id,year(purchase_date),sum(price)from transaction_detailsgroup by store_id,year(purchase_date)-- 合计到一张表select distinct *from(select store_id,year(purchase_date),sum(price) over(partition by year(purchase_date)),month(purchase_date),sum(price) over(partition by year(purchase_date),month(purchase_date))from transaction_details)a

在这里插入图片描述

8.7按店铺制作总收益饼图

select store_id,sum(price)from transaction_detailsgroup by store_id

在这里插入图片描述

8.8找出每个商店最繁忙的时间段

witht1 as(select *, if(instr(purchase_time,'PM')>0,				if(cast(regexp_extract(purchase_time,'([0-9]{1,2}):([0-9]{2}\\w*)',1)as int)+12>=24,					0,					cast(regexp_extract(purchase_time,'([0-9]{1,2}):([0-9]{2}\\w*)',1)as int)+12),				cast(regexp_extract(purchase_time,'([0-9]{1,2}):([0-9]{2}\\w*)',1)as int)) as timeTransfrom transaction_details), t2 as(select t1.*,case when t1.timeTrans<=8 and t1.timeTrans>5 then 'early morning'				 when t1.timeTrans<=11 and t1.timeTrans>8 then 'morning'				 when t1.timeTrans<=13 and t1.timeTrans>11 then 'noon'				 when t1.timeTrans<=18 and t1.timeTrans>13 then 'afternoon'				 when t1.timeTrans<=22 and t1.timeTrans>18 then 'evening'				 else 'night'			end as timeSplitfrom t1),t3 as(select t2.store_id,t2.timeSplit,count(1) c from t2 group by t2.store_id,t2.timeSplit),t4 as(select t3.store_id,t3.timeSplit,row_number() over(partition by store_id order by t3.timeSplit desc)as win1from t3 )select t4.store_id,t4.timeSplitfrom t4where t4.win1=1

在这里插入图片描述

8.9找出每家店的忠实顾客

-- 购买超过6次select a.*from(select store_id,customer_id,count(1) cfrom transaction_detailsgroup by store_id,customer_id)a where a.c>6

在这里插入图片描述

8.10根据每位员工的最高收入找出明星商店

-- 求总收入与雇员比值的最大值witht1 as (select  store_id,sum(price) s from transaction_details group by store_id)select t1.store_id,t1.s/s.employee_number ssfrom t1 join ext_store_details s  on s.store_id= t1.store_idorder by ss desc limit 1

在这里插入图片描述

9.Review分析

9.1在ext_store_review中找出存在冲突的交易映射关系

select transaction_idfrom vw_store_reviewgroup by transaction_idhaving count(1)>1

在这里插入图片描述

9.2了解客户评价的覆盖率

-- 求各个店共有多少顾客评价with t1 as(select t2.store_id,t1.transaction_id,t2.customer_idfrom vw_store_review t1 join transaction_details t2 on t1.transaction_id=t2.transaction_id)select t1.store_id,count(distinct t1.customer_id)from t1 group by t1.store_id

在这里插入图片描述

9.3根据评分了解客户的分布情况

-- 求每家店每个评分有多少个客户给的witht1 as(select  t2.store_id ,t1.review_score,t2.customer_idfrom vw_store_review t1 join  transaction_details t2 on t1.transaction_id=t2.transaction_id)select t1.store_id,t1.review_score,count(distinct customer_id)from t1group by t1.store_id,t1.review_score

在这里插入图片描述

9.4根据交易了解客户的分布情况

-- 求每家店每个客户的订单数select store_id,customer_id,count(1)from transaction_detailsgroup by store_id,customer_id

在这里插入图片描述

9.5客户给出的最佳评价是否总是同一家门店

-- 每位顾客对每家店的评分只取最大值,然后筛选每家店评分为5的数量,最大就是最优店witht1 as(select r.store_id,t.customer_id,max(r.review_score) m from ext_store_review r  join transaction_details t on r.transaction_id = t.transaction_id group by r.store_id,t.customer_id),t2 as (select * from t1 where t1.m=5)select store_id,count(t2.m) c  from t2 group by store_idorder by c desc limit 1

在这里插入图片描述

转载地址:http://ocjxi.baihongyu.com/

你可能感兴趣的文章
从手Q与微信之争,看腾讯内在的真实矛盾与战略
查看>>
移动互联网的七宗败案
查看>>
互联网十大失败案
查看>>
小米颓势已现,生死劫命悬手机
查看>>
三大隐忧 三星未来路在何方?
查看>>
linux下各种进制转化最简单的的命令行
查看>>
结构体和联合体
查看>>
ACM(Association for Computing Machinery )组织的详细介绍
查看>>
unix高级编程之-命令行参数(实践一)
查看>>
无线网络加密方式对比 .
查看>>
linux中cat命令使用详解
查看>>
Static 作用详述
查看>>
透析ICMP协议(三): 牛刀初试之一 应用篇ping(ICMP.dll)
查看>>
透析ICMP协议(四): 牛刀初试之二 应用篇ping(RAW Socket)
查看>>
再次写给我们这些浮躁的程序员
查看>>
Linux下重要日志文件及查看方式(1)
查看>>
Linux下重要日志文件及查看方式(2)
查看>>
Ubuntu系统root用户密码找回方法
查看>>
Linux驱动程序中比较重要的宏
查看>>
芯片驱动问题定位思路总结之一单板重启的问题
查看>>