Search code examples
sqlperformancehivetime-series

SQL Return the record where the value in a column have changed


I have data in Hive table which look something like this -

VIN Mode event Start End
ABC123456789 Mode 1 Deauthorized 01/01/2010 00:00:00 05/05/2014 14:54:54
ABC123456789 Mode 1 Deauthorized 05/05/2014 14:54:54 05/13/2014 19:09:51
ABC123456789 Mode 1 Deauthorized 05/13/2014 19:09:51 11/13/2014 22:26:32
ABC123456789 Mode 1 Authorized 11/13/2014 22:26:32 11/13/2014 22:31:00
ABC123456789 Mode 1 Authorized 11/13/2014 22:31:00 11/14/2014 01:23:56
ABC123456789 Mode 2 Deauthorized 11/14/2014 01:23:56 11/18/2014 19:38:51
ABC123456789 Mode 2 Deauthorized 11/18/2014 19:38:51 11/18/2014 19:38:54
ABC123456789 Mode 2 Deauthorized 11/18/2014 19:38:54 11/18/2014 20:07:52
ABC123456789 Mode 2 Authorized 11/18/2014 20:07:52 12/17/2014 19:22:50
ABC123456789 Mode 2 Authorized 12/17/2014 19:22:50 02/25/2015 20:03:44
ABC123456789 Mode 2 Authorized 02/25/2015 20:03:44 02/25/2015 20:03:48
ABC123456789 Mode 3 Authorized 02/25/2015 20:03:48 02/25/2015 20:14:05
ABC123456789 Mode 3 Deauthorized 02/25/2015 20:14:05 02/25/2015 20:14:29
ABC123456789 Mode 3 Deauthorized 02/25/2015 20:14:29 02/25/2015 20:40:21

I would like to get a summarized data where the value in event column has changed from previous value. The data points are in ascending order of the Start timestamp. I tried window function but it didn't work for me. The result would look similar to what I have shown in below table. Can you suggest any optimized solution for this?

VIN Mode event Start End
ABC123456789 Mode 1 Deauthorized 01/01/2010 00:00:00 05/05/2014 14:54:54
ABC123456789 Mode 1 Authorized 11/13/2014 22:26:32 11/13/2014 22:31:00
ABC123456789 Mode 2 Deauthorized 11/14/2014 01:23:56 11/18/2014 19:38:51
ABC123456789 Mode 2 Authorized 11/18/2014 20:07:52 12/17/2014 19:22:50
ABC123456789 Mode 3 Deauthorized 02/25/2015 20:14:05 02/25/2015 20:14:29

Solution

  • You can use lag():

    select t.*
    from (select t.*, 
                 lag(event) over (partition by vin order by start) as prev_event
          from t
         ) t
    where prev_event is null or prev_event <> event;
    

    This looks at the changes by time and vin. I'm not sure if the mode is relevant too. If so, add it to the partition by.