I have a hive table of two columns. The first columns is time, and second is scattered objects. I wish to get all groups of same objects which are continuous in time, and pick up the first and last record. How to achieve this in hive?
For example, I have a table like this:
id time object
1 10:01:00 a
2 10:02:00 a
3 10:03:00 a
4 10:04:00 b
5 10:05:00 b
6 10:06:00 a
7 10:07:00 a
8 10:08:00 a
9 10:09:00 a
10 10:10:00 a
11 10:11:00 c
I wish to get this (as object 'a' is continuous from 10:01:00 to 10:03:00 and from 10:06:00 to 10:10:00, so both line1&line3 and line6&line10 are picked up):
id time object
1 10:01:00 a
3 10:03:00 a
4 10:04:00 b
5 10:05:00 b
6 10:06:00 a
10 10:10:00 a
11 10:11:00 c
What should I do to achieve this?
This is island and gap problem and you can use the row_number analytical function as follows:
select * from
(select t.*,
row_number() over (partition by rn-rn_o order by time) as rn,
row_number() over (partition by rn-rn_o order by time desc) as rn_d
from
(select t.*,
row_number() over (order by time) as rn,
row_number() over (partition by object order by time) as rn_o
from your_table t) t)
where 1 in (rn, rn_d);