I have a hive table in which we are getting data per day with collection time (unix timestamp)
+--------+-----------------+------+----------+
| ticket | collection_time | type | day |
+--------+-----------------+------+----------+
| t1 | 123 | auto | 20170302 |
| t2 | 234 | req | 20170302 |
| t3 | 345 | req | 20170302 |
| t4 | 678 | auto | 20170303 |
| t5 | 111 | req | 20170301 |
| t6 | 222 | auto | 20170301 |
| t7 | 333 | auto | 20170301 |
| t8 | 444 | req | 20170301 |
+--------+-----------------+------+----------+
I want to find the total count of tickets on a day and (collection time and type) of ticket happening at first collection time and last collection time of a day. Please suggest any way to solve this problem using hive query.
+---------------+---------------------+----------------+--------------------+---------------+----------+
| count(ticket) | first_tkt_coll_time | first_tkt_type | last_tkt_coll_time | last_tkt_type | day |
+---------------+---------------------+----------------+--------------------+---------------+----------+
| 3 | 123 | auto | 345 | req20170302 | |
| 1 | 678 | auto | 678 | auto | 20170303 |
| 4 | 111 | req | 444 | req | 20170301 |
+---------------+---------------------+----------------+--------------------+---------------+----------+
And we need to feed the count(ticket) for next day also in the same output set.
+---------+-------------+-----------------+------+-----------------+------+-----+
| cnt_day | cnt_nxt_day | collection_time | type | collection_time | type | day |
+---------+-------------+-----------------+------+-----------------+------+-----+
| 4 | 3| 111 | req | 444 | req | 20170301 |
| 3 | 1| 123 | auto | 345 | req | 20170302 |
| 1 | - | 678 | auto | 678 | auto | 20170303 |
+---------+-------------+-----------------+------+-----------------+------+-----+
Thank you in advance :)
select cnt
,cnt_next_day
,min_val.collection_time
,min_val.type
,max_val.collection_time
,max_val.type
,day
from (select count(*) as cnt
,lead (count(*)) over (order by day) as cnt_next_day
,min(named_struct('collection_time',collection_time,'type',type)) as min_val
,max(named_struct('collection_time',collection_time,'type',type)) as max_val
,day
from mytable
group by day
) t
+-----+--------------+-----------------+------+-----------------+------+----------+
| cnt | cnt_next_day | collection_time | type | collection_time | type | day |
+-----+--------------+-----------------+------+-----------------+------+----------+
| 4 | 3 | 111 | req | 444 | req | 20170301 |
| 3 | 1 | 123 | auto | 345 | req | 20170302 |
| 1 | NULL | 678 | auto | 678 | auto | 20170303 |
+-----+--------------+-----------------+------+-----------------+------+----------+