大数据学习24:hive 您所在的位置:网站首页 hive累计求和函数 大数据学习24:hive

大数据学习24:hive

2024-06-02 04:58| 来源: 网络整理| 查看: 265

环境: hive

需求: 输入数据,每天用户的流量 分隔符 \t

user date access panda 2017-1-01 5t gifshow 2017-1-01 3t yy 2017-1-01 2t laifeng 2017-1-01 2t panda 2017-1-02 5t gifshow 2017-1-02 3t yy 2017-1-02 2t laifeng 2017-1-02 2t panda 2017-2-01 4t gifshow 2017-2-01 3t yy 2017-2-01 1t laifeng 2017-2-01 4t panda 2017-2-02 4t gifshow 2017-2-02 3t yy 2017-2-02 1t laifeng 2017-2-02 4t panda 2017-3-01 4t gifshow 2017-3-01 3t yy 2017-3-01 1t laifeng 2017-3-01 4t panda 2017-3-02 4t gifshow 2017-3-02 3t yy 2017-3-02 1t laifeng 2017-3-02 4t =============>求出下表,每个用户按月统计当月数值,并新增按月累加字段 user date acc acc_sum gifshow 2017-1 6 6 gifshow 2017-2 6 12 gifshow 2017-3 6 18 laifeng 2017-1 4 4 laifeng 2017-2 8 12 laifeng 2017-3 8 20 panda 2017-1 10 10 panda 2017-2 8 18 panda 2017-3 8 26 yy 2017-1 4 4 yy 2017-2 2 6 yy 2017-3 2 8

分析: 由于原输入数据为文本,所以从本地导入hive 建 shipin_origin 的时候,hive表中字段均为String。 那么结果需要按月统计,则必须对 date 进行 substr() 。 在本案例中,有两种方式实现这个功能。 1)通过 hive 自带的函数 sum()over() 进行求解,效果好,时间短 2)通过 标准 sql 进行 inner join 进行求解,效果差,但是更能理解inner join 和 实现原理

操作: 方法一: 1、先在hive上建表,并load数据

create table shipin_origin ( user String, date String, access String ) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'; LOAD DATA LOCAL INPATH '/root/data/shipin_origin' OVERWRITE INTO TABLE shipin_origin;

2、这个查询是用来将原始表进行按月累计,形成一个每个月总量的中间表

select user,substr(date,1,6) c1 ,sum(cast(substr(access,1,1) as INT )) c2 from shipin_origin a group by user,substr(date,1,6) 结果: gifshow 2017-1 6 gifshow 2017-2 6 gifshow 2017-3 6 laifeng 2017-1 4 laifeng 2017-2 8 laifeng 2017-3 8 panda 2017-1 10 panda 2017-2 8 panda 2017-3 8 yy 2017-1 4 yy 2017-2 2 yy 2017-3 2

3、通过这个中间表,采用sum()over() 函数

select A.user ,A.c1 ,A.c2 ,sum(A.c2)over(partition by A.user order by A.c1 ) acc_sum from ( select user,substr(date,1,6) c1 ,sum(cast(substr(access,1,1) as INT )) c2 from shipin_origin a group by user,substr(date,1,6) ) A; 结果: gifshow 2017-1 6 6 gifshow 2017-2 6 12 gifshow 2017-3 6 18 laifeng 2017-1 4 4 laifeng 2017-2 8 12 laifeng 2017-3 8 20 panda 2017-1 10 10 panda 2017-2 8 18 panda 2017-3 8 26 yy 2017-1 4 4 yy 2017-2 2 6 yy 2017-3 2 8 Time taken: 23.988 seconds, Fetched: 12 row(s)

注意:在上面这个sql中,因为使用了嵌套查询,一定要把嵌套内的查询结果进行别名处理,否则会出现以下的报错:

FAILED: ParseException line 3:53 cannot recognize input near '' '' '' in subquery source

方法二: 1、先在hive上建表,并load数据

create table shipin_origin ( user String, date String, access String ) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'; LOAD DATA LOCAL INPATH '/root/data/shipin_origin' OVERWRITE INTO TABLE shipin_origin;

2、同样将原始表进行按月累计,形成一个每个月总量的中间表,但是这里我们创建一个中间表去存储 (为什么这里进行了中间表存储,是为了简便sql写法,不建表其实也可以,但是逻辑上面要注意!!)

hive (default)> create table mid_shipin ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' as select user,substr(date,1,6) c1 ,sum (cast(substr(access,1,1) as INT )) c2 from shipin_origin a group by user,substr(date,1,6) ; 结果: gifshow 2017-1 6 gifshow 2017-2 6 gifshow 2017-3 6 laifeng 2017-1 4 laifeng 2017-2 8 laifeng 2017-3 8 panda 2017-1 10 panda 2017-2 8 panda 2017-3 8 yy 2017-1 4 yy 2017-2 2 yy 2017-3 2 hive (default)> desc mid_shipin; OK user string c1 string c2 bigint Time taken: 0.326 seconds, Fetched: 3 row(s)

3、开始对这个表进行操作,并获得结果:

select A.user,A.c1,max(A.c2) c2 ,sum(B.c2) accumulate from (select user,c1,sum(c2) c2 from mid_shipin group by user,c1) A inner join (select user,c1,sum(c2) c2 from mid_shipin group by user,c1 ) B on A.user = B.user where B.c1结果: gifshow 2017-1 6 6 gifshow 2017-2 6 12 gifshow 2017-3 6 18 laifeng 2017-1 4 4 laifeng 2017-2 8 12 laifeng 2017-3 8 20 panda 2017-1 10 10 panda 2017-2 8 18 panda 2017-3 8 26 yy 2017-1 4 4 yy 2017-2 2 6 yy 2017-3 2 8 Time taken: 114.184 seconds, Fetched: 12 row(s)

这个过程中,用到了4对 mapreduce 和 1次 map 时间消耗很大!

过程分析:

#为了测试,放到mysql 里面看sql逻辑 create table sp1 ( user varchar(10), date varchar(10), acc int ); insert into sp1 values ('gifshow', '2017-1', 6 ); insert into sp1 values ('gifshow', '2017-2' , 6 ); insert into sp1 values ('gifshow', '2017-3' , 6 ); insert into sp1 values ('laifeng', '2017-1' , 4 ); insert into sp1 values ('laifeng', '2017-2' , 8 ); insert into sp1 values ('laifeng', '2017-3' , 8 ); insert into sp1 values ('panda' , '2017-1' , 10 ); insert into sp1 values ('panda' , '2017-2' , 8 ); insert into sp1 values ('panda' , '2017-3' , 8 ); insert into sp1 values ('yy' , '2017-1' , 4 ); insert into sp1 values ('yy' , '2017-2' , 2 ); insert into sp1 values ('yy' , '2017-3' , 2 ); select * from (select user,date ,acc from sp1) A inner join (select user,date ,acc from sp1) B on A.user = B.user ;

注意后面带 * 的行

+---------+--------+------+---------+--------+------+ | user | date | acc | user | date | acc | +---------+--------+------+---------+--------+------+ | gifshow | 2017-1 | 6 | gifshow | 2017-1 | 6 | | gifshow | 2017-1 | 6 | gifshow | 2017-2 | 6 |* | gifshow | 2017-1 | 6 | gifshow | 2017-3 | 6 |* | gifshow | 2017-2 | 6 | gifshow | 2017-1 | 6 | | gifshow | 2017-2 | 6 | gifshow | 2017-2 | 6 | | gifshow | 2017-2 | 6 | gifshow | 2017-3 | 6 |* | gifshow | 2017-3 | 6 | gifshow | 2017-1 | 6 | | gifshow | 2017-3 | 6 | gifshow | 2017-2 | 6 | | gifshow | 2017-3 | 6 | gifshow | 2017-3 | 6 | | laifeng | 2017-1 | 4 | laifeng | 2017-1 | 4 | | laifeng | 2017-1 | 4 | laifeng | 2017-2 | 8 |* | laifeng | 2017-1 | 4 | laifeng | 2017-3 | 8 |* | laifeng | 2017-2 | 8 | laifeng | 2017-1 | 4 | | laifeng | 2017-2 | 8 | laifeng | 2017-2 | 8 | | laifeng | 2017-2 | 8 | laifeng | 2017-3 | 8 |* | laifeng | 2017-3 | 8 | laifeng | 2017-1 | 4 | | laifeng | 2017-3 | 8 | laifeng | 2017-2 | 8 | | laifeng | 2017-3 | 8 | laifeng | 2017-3 | 8 | | panda | 2017-1 | 10 | panda | 2017-1 | 10 | | panda | 2017-1 | 10 | panda | 2017-2 | 8 |* | panda | 2017-1 | 10 | panda | 2017-3 | 8 |* | panda | 2017-2 | 8 | panda | 2017-1 | 10 | | panda | 2017-2 | 8 | panda | 2017-2 | 8 | | panda | 2017-2 | 8 | panda | 2017-3 | 8 |* | panda | 2017-3 | 8 | panda | 2017-1 | 10 | | panda | 2017-3 | 8 | panda | 2017-2 | 8 | | panda | 2017-3 | 8 | panda | 2017-3 | 8 | | yy | 2017-1 | 4 | yy | 2017-3 | 2 |* | yy | 2017-1 | 4 | yy | 2017-1 | 4 | | yy | 2017-1 | 4 | yy | 2017-2 | 2 |* | yy | 2017-2 | 2 | yy | 2017-3 | 2 |* | yy | 2017-2 | 2 | yy | 2017-1 | 4 | | yy | 2017-2 | 2 | yy | 2017-2 | 2 | | yy | 2017-3 | 2 | yy | 2017-3 | 2 | | yy | 2017-3 | 2 | yy | 2017-1 | 4 | | yy | 2017-3 | 2 | yy | 2017-2 | 2 | +---------+--------+------+---------+--------+------+ 36 rows in set (0.00 sec) mysql> select * from (select user,date ,acc from sp1) A inner join (select user,date ,acc from sp1) B on A.user = B.user where B.date select A.user,A.date,max(a.acc) acc_month, sum(B.acc) acc_sum from (select user,date ,acc from sp1) A inner join (select user,date ,acc from sp1) B on A.user = B.user where B.date


【本文地址】

公司简介

联系我们

今日新闻

    推荐新闻

    专题文章
      CopyRight 2018-2019 实验室设备网 版权所有