Search code examples
sqlhivehiveql

How to select the first and the last record base on time-sequence in hive?


I have a hive table of two columns. The first columns is time, and second is scattered objects. I wish to get all groups of same objects which are continuous in time, and pick up the first and last record. How to achieve this in hive?

For example, I have a table like this:

id   time      object
1   10:01:00   a
2   10:02:00   a
3   10:03:00   a
4   10:04:00   b
5   10:05:00   b
6   10:06:00   a
7   10:07:00   a
8   10:08:00   a
9   10:09:00   a
10  10:10:00   a
11  10:11:00   c

I wish to get this (as object 'a' is continuous from 10:01:00 to 10:03:00 and from 10:06:00 to 10:10:00, so both line1&line3 and line6&line10 are picked up):

id   time      object
1   10:01:00   a
3   10:03:00   a
4   10:04:00   b
5   10:05:00   b
6   10:06:00   a
10  10:10:00   a
11  10:11:00   c

What should I do to achieve this?


Solution

  • This is island and gap problem and you can use the row_number analytical function as follows:

    select * from
    (select t.*,
           row_number() over (partition by rn-rn_o order by time) as rn,
           row_number() over (partition by rn-rn_o order by time desc) as rn_d
      from
    (select t.*,
           row_number() over (order by time) as rn,
           row_number() over (partition by object order by time) as rn_o
      from your_table t) t)
    where 1 in (rn, rn_d);