Search code examples
sqlfunctionhadoophivehiveql

Using collect_list function in SQL (Hive) to aggregate user sequence


I have a dataset which represents individuals' usage a form in which they have to enter values.

The form can be completed in any order beyond 'firstname' which is step 1.

The timestamp is used to imply in which order the form was completed.

Eg - user 12345678 begins filling the form at 04:58:08 on 2017-10-25 and completes the form sequentially.

User 12345679 begins filling the form at 05:00:02 on the same day but only gets as far as step 2

User 12345680 begins filling the form at 05:05:06 but is distracted and does not get beyond step 1, yet returns another two times before completing

User 12345681 begins filling the form at 06:31:12 and begins with step 1, but then fills out the form randomly.

|    date  |     timestamp      | user_id  |  visit_id   |      event    |  event_seq  |
|--------------------------------------------------------------------------------------|
|2017-10-25| 2017-10-25 04:58:08| 12345678 |    1234     |   firstname   |       1     |
|2017-10-25| 2017-10-25 04:58:20| 12345678 |    1234     |   lastname    |       2     |
|2017-10-25| 2017-10-25 04:58:35| 12345678 |    1234     |      dob      |       3     |
|2017-10-25| 2017-10-25 04:58:40| 12345678 |    1234     |   postcode    |       4     |
|2017-10-25| 2017-10-25 04:58:40| 12345678 |    1234     |     email     |       5     |
|2017-10-25| 2017-10-25 05:00:02| 12345679 |    1235     |   firstname   |       1     |
|2017-10-25| 2017-10-25 05:00:10| 12345679 |    1235     |   lastname    |       2     |
|2017-10-25| 2017-10-25 05:05:06| 12345680 |    1236     |   firstname   |       1     |
|2017-10-25| 2017-10-25 05:30:24| 12345680 |    1236     |   firstname   |       1     |
|2017-10-25| 2017-10-25 06:17:24| 12345680 |    1236     |   firstname   |       1     |
|2017-10-25| 2017-10-25 06:20:30| 12345680 |    1236     |   lastname    |       2     |
|2017-10-25| 2017-10-25 06:20:45| 12345680 |    1236     |      dob      |       3     |
|2017-10-25| 2017-10-25 06:20:45| 12345680 |    1236     |   postcode    |       4     |
|2017-10-25| 2017-10-25 06:20:45| 12345680 |    1236     |     email     |       5     |
|2017-10-25| 2017-10-25 06:31:12| 12345681 |    1237     |   firstname   |       1     |
|2017-10-25| 2017-10-25 06:31:18| 12345681 |    1237     |     email     |       5     |
|2017-10-25| 2017-10-25 06:31:50| 12345681 |    1237     |   lastname    |       2     |
|2017-10-25| 2017-10-25 06:32:16| 12345681 |    1237     |   postcode    |       4     |
|2017-10-25| 2017-10-25 06:32:40| 12345681 |    1237     |      dob      |       3     |

The code I have written is as follows and references a pre-existing table in which CASE WHEN is used to assign a number to each step of the form in a variable called 'events':

SELECT date
,time_stamp
,user_id
,visit_id
,collect_list(events) as event_seq
FROM my_table
GROUP BY date
,start_time
time_stamp
,user_id
,visit_id 

As is somewhat expected this appears to be listing all of user 12345680's interactions into one string;

|    date  |     timestamp      | user_id  |  visit_id   |  event_seq  |
|----------------------------------------------------------------------|
|2017-10-25| 2017-10-25 04:58:08| 12345678 |    1234     |  1,2,3,4,5  |
|2017-10-25| 2017-10-25 05:00:02| 12345679 |    1235     |     1,2     |
|2017-10-25| 2017-10-25 05:05:06| 12345680 |    1236     |1,1,1,2,3,4,5| 
|2017-10-25| 2017-10-25 06:31:12| 12345681 |    1237     |  1,5,2,4,3, | 

However, what I would like to see is each row demarked by the first event in the sequence, akin to the below result set, where each of user 12345680's restarts occur on a different line.

|    date  |     timestamp      | user_id  |  visit_id   |  event_seq  |
|----------------------------------------------------------------------|
|2017-10-25| 2017-10-25 04:58:08| 12345678 |    1234     |  1,2,3,4,5  |
|2017-10-25| 2017-10-25 05:00:02| 12345679 |    1235     |     1,2     |
|2017-10-25| 2017-10-25 05:05:06| 12345680 |    1236     |      1      |
|2017-10-25| 2017-10-25 05:30:24| 12345680 |    1236     |      1      |
|2017-10-25| 2017-10-25 06:17:24| 12345680 |    1236     |  1,2,3,4,5  |      
|2017-10-25| 2017-10-25 06:31:12| 12345681 |    1237     |  1,5,2,4,3, | 

Can anybody offer any guidance as to how I can use collect_list to achieve my desired results set?


Solution

  • From your SQL, it appears you have a column called start_time in the table. Assuming you have one, see the solution below

    Table

    CREATE EXTERNAL TABLE my_table(
      event_date DATE,
      event_start_timestamp TIMESTAMP,
      event_timestamp TIMESTAMP,
      user_id STRING,
      visit_id STRING,
      event STRING,
      event_seq STRING)
    ROW FORMAT DELIMITED FIELDS TERMINATED BY '|' 
    STORED AS TEXTFILE;
    

    Data

    2017-10-25| 2017-10-25 04:58:08| 2017-10-25 04:58:08| 12345678 |    1234     |   firstname   |       1     
    2017-10-25| 2017-10-25 04:58:08| 2017-10-25 04:58:20| 12345678 |    1234     |   lastname    |       2     
    2017-10-25| 2017-10-25 04:58:08| 2017-10-25 04:58:35| 12345678 |    1234     |      dob      |       3     
    2017-10-25| 2017-10-25 04:58:08| 2017-10-25 04:58:40| 12345678 |    1234     |   postcode    |       4     
    2017-10-25| 2017-10-25 04:58:08| 2017-10-25 04:58:40| 12345678 |    1234     |     email     |       5     
    2017-10-25| 2017-10-25 05:00:02| 2017-10-25 05:00:02| 12345679 |    1235     |   firstname   |       1     
    2017-10-25| 2017-10-25 05:00:02| 2017-10-25 05:00:10| 12345679 |    1235     |   lastname    |       2     
    2017-10-25| 2017-10-25 05:05:06| 2017-10-25 05:05:06| 12345680 |    1236     |   firstname   |       1     
    2017-10-25| 2017-10-25 05:30:24| 2017-10-25 05:30:24| 12345680 |    1236     |   firstname   |       1     
    2017-10-25| 2017-10-25 06:17:24| 2017-10-25 06:17:24| 12345680 |    1236     |   firstname   |       1     
    2017-10-25| 2017-10-25 06:17:24| 2017-10-25 06:20:30| 12345680 |    1236     |   lastname    |       2     
    2017-10-25| 2017-10-25 06:17:24| 2017-10-25 06:20:45| 12345680 |    1236     |      dob      |       3     
    2017-10-25| 2017-10-25 06:17:24| 2017-10-25 06:20:45| 12345680 |    1236     |   postcode    |       4     
    2017-10-25| 2017-10-25 06:17:24| 2017-10-25 06:20:45| 12345680 |    1236     |     email     |       5     
    2017-10-25| 2017-10-25 06:31:12| 2017-10-25 06:31:12| 12345681 |    1237     |   firstname   |       1     
    2017-10-25| 2017-10-25 06:31:12| 2017-10-25 06:31:18| 12345681 |    1237     |     email     |       5     
    2017-10-25| 2017-10-25 06:31:12| 2017-10-25 06:31:50| 12345681 |    1237     |   lastname    |       2     
    2017-10-25| 2017-10-25 06:31:12| 2017-10-25 06:32:16| 12345681 |    1237     |   postcode    |       4     
    2017-10-25| 2017-10-25 06:31:12| 2017-10-25 06:32:40| 12345681 |    1237     |      dob      |       3   
    

    SQL Query

    SELECT event_date,
           user_id, 
           visit_id, 
           event_start_timestamp, 
           collect_list(event_seq)
      FROM (SELECT event_date, 
                   event_start_timestamp, 
                   event_timestamp, 
                   user_id, 
                   visit_id, 
                   event_seq 
              FROM my_table 
              SORT BY user_id, visit_id, event_start_timestamp, event_timestamp ASC) v
     GROUP BY event_date, user_id, visit_id, event_start_timestamp ;
    

    Output

    2017-10-25   12345678       1234        2017-10-25 04:58:08 ["       1     ","       2     ","       3     ","       4     ","       5     "]
    2017-10-25   12345679       1235        2017-10-25 05:00:02 ["       1     ","       2     "]
    2017-10-25   12345680       1236        2017-10-25 05:05:06 ["       1     "]
    2017-10-25   12345680       1236        2017-10-25 05:30:24 ["       1     "]
    2017-10-25   12345680       1236        2017-10-25 06:17:24 ["       1     ","       2     ","       3     ","       4     ","       5     "]
    2017-10-25   12345681       1237        2017-10-25 06:31:12 ["       1     ","       5     ","       2     ","       4     ","       3     "]
    

    Let us know if this works!

    And please don't use column names as DATE, TIMESTAMP etc, they are reserved words :)