Search code examples
sqlhadoophivehql

Hive - pull the data for first and last records based on collection time


I have a hive table in which we are getting data per day with collection time (unix timestamp)

+--------+-----------------+------+----------+
| ticket | collection_time | type |   day    |
+--------+-----------------+------+----------+
| t1     |             123 | auto | 20170302 |
| t2     |             234 | req  | 20170302 |
| t3     |             345 | req  | 20170302 |
| t4     |             678 | auto | 20170303 |
| t5     |             111 | req  | 20170301 |
| t6     |             222 | auto | 20170301 |
| t7     |             333 | auto | 20170301 |
| t8     |             444 | req  | 20170301 |
+--------+-----------------+------+----------+

I want to find the total count of tickets on a day and (collection time and type) of ticket happening at first collection time and last collection time of a day. Please suggest any way to solve this problem using hive query.

+---------------+---------------------+----------------+--------------------+---------------+----------+
| count(ticket) | first_tkt_coll_time | first_tkt_type | last_tkt_coll_time | last_tkt_type |   day    |
+---------------+---------------------+----------------+--------------------+---------------+----------+
|             3 |                 123 | auto           |                345 | req20170302   |          |
|             1 |                 678 | auto           |                678 | auto          | 20170303 |
|             4 |                 111 | req            |                444 | req           | 20170301 |
+---------------+---------------------+----------------+--------------------+---------------+----------+

And we need to feed the count(ticket) for next day also in the same output set.

+---------+-------------+-----------------+------+-----------------+------+---‌​--+ 
| cnt_day | cnt_nxt_day | collection_time | type | collection_time | type | day | 
+---------+-------------+-----------------+------+-----------------+------+---‌​--+ 
| 4 | 3| 111 | req | 444 | req | 20170301 |
| 3 | 1| 123 | auto | 345 | req | 20170302 | 
| 1 | - | 678 | auto | 678 | auto | 20170303 | 
+---------+-------------+-----------------+------+-----------------+------+---‌​--+ 

Thank you in advance :)


Solution

  • select  cnt
           ,cnt_next_day 
           ,min_val.collection_time
           ,min_val.type
           ,max_val.collection_time
           ,max_val.type
           ,day
    
    from   (select    count(*)                                                         as cnt
                     ,lead (count(*)) over (order by day)                              as cnt_next_day
                     ,min(named_struct('collection_time',collection_time,'type',type)) as min_val
                     ,max(named_struct('collection_time',collection_time,'type',type)) as max_val
                     ,day
    
            from      mytable
    
            group by  day
            ) t
    

    +-----+--------------+-----------------+------+-----------------+------+----------+
    | cnt | cnt_next_day | collection_time | type | collection_time | type |   day    |
    +-----+--------------+-----------------+------+-----------------+------+----------+
    |   4 | 3            |             111 | req  |             444 | req  | 20170301 |
    |   3 | 1            |             123 | auto |             345 | req  | 20170302 |
    |   1 | NULL         |             678 | auto |             678 | auto | 20170303 |
    +-----+--------------+-----------------+------+-----------------+------+----------+