I have a dataset which represents individuals' usage a form in which they have to enter values.
The form can be completed in any order beyond 'firstname' which is step 1.
The timestamp is used to imply in which order the form was completed.
Eg - user 12345678 begins filling the form at 04:58:08 on 2017-10-25 and completes the form sequentially.
User 12345679 begins filling the form at 05:00:02 on the same day but only gets as far as step 2
User 12345680 begins filling the form at 05:05:06 but is distracted and does not get beyond step 1, yet returns another two times before completing
User 12345681 begins filling the form at 06:31:12 and begins with step 1, but then fills out the form randomly.
| date | timestamp | user_id | visit_id | event | event_seq |
|--------------------------------------------------------------------------------------|
|2017-10-25| 2017-10-25 04:58:08| 12345678 | 1234 | firstname | 1 |
|2017-10-25| 2017-10-25 04:58:20| 12345678 | 1234 | lastname | 2 |
|2017-10-25| 2017-10-25 04:58:35| 12345678 | 1234 | dob | 3 |
|2017-10-25| 2017-10-25 04:58:40| 12345678 | 1234 | postcode | 4 |
|2017-10-25| 2017-10-25 04:58:40| 12345678 | 1234 | email | 5 |
|2017-10-25| 2017-10-25 05:00:02| 12345679 | 1235 | firstname | 1 |
|2017-10-25| 2017-10-25 05:00:10| 12345679 | 1235 | lastname | 2 |
|2017-10-25| 2017-10-25 05:05:06| 12345680 | 1236 | firstname | 1 |
|2017-10-25| 2017-10-25 05:30:24| 12345680 | 1236 | firstname | 1 |
|2017-10-25| 2017-10-25 06:17:24| 12345680 | 1236 | firstname | 1 |
|2017-10-25| 2017-10-25 06:20:30| 12345680 | 1236 | lastname | 2 |
|2017-10-25| 2017-10-25 06:20:45| 12345680 | 1236 | dob | 3 |
|2017-10-25| 2017-10-25 06:20:45| 12345680 | 1236 | postcode | 4 |
|2017-10-25| 2017-10-25 06:20:45| 12345680 | 1236 | email | 5 |
|2017-10-25| 2017-10-25 06:31:12| 12345681 | 1237 | firstname | 1 |
|2017-10-25| 2017-10-25 06:31:18| 12345681 | 1237 | email | 5 |
|2017-10-25| 2017-10-25 06:31:50| 12345681 | 1237 | lastname | 2 |
|2017-10-25| 2017-10-25 06:32:16| 12345681 | 1237 | postcode | 4 |
|2017-10-25| 2017-10-25 06:32:40| 12345681 | 1237 | dob | 3 |
The code I have written is as follows and references a pre-existing table in which CASE WHEN is used to assign a number to each step of the form in a variable called 'events':
SELECT date
,time_stamp
,user_id
,visit_id
,collect_list(events) as event_seq
FROM my_table
GROUP BY date
,start_time
time_stamp
,user_id
,visit_id
As is somewhat expected this appears to be listing all of user 12345680's interactions into one string;
| date | timestamp | user_id | visit_id | event_seq |
|----------------------------------------------------------------------|
|2017-10-25| 2017-10-25 04:58:08| 12345678 | 1234 | 1,2,3,4,5 |
|2017-10-25| 2017-10-25 05:00:02| 12345679 | 1235 | 1,2 |
|2017-10-25| 2017-10-25 05:05:06| 12345680 | 1236 |1,1,1,2,3,4,5|
|2017-10-25| 2017-10-25 06:31:12| 12345681 | 1237 | 1,5,2,4,3, |
However, what I would like to see is each row demarked by the first event in the sequence, akin to the below result set, where each of user 12345680's restarts occur on a different line.
| date | timestamp | user_id | visit_id | event_seq |
|----------------------------------------------------------------------|
|2017-10-25| 2017-10-25 04:58:08| 12345678 | 1234 | 1,2,3,4,5 |
|2017-10-25| 2017-10-25 05:00:02| 12345679 | 1235 | 1,2 |
|2017-10-25| 2017-10-25 05:05:06| 12345680 | 1236 | 1 |
|2017-10-25| 2017-10-25 05:30:24| 12345680 | 1236 | 1 |
|2017-10-25| 2017-10-25 06:17:24| 12345680 | 1236 | 1,2,3,4,5 |
|2017-10-25| 2017-10-25 06:31:12| 12345681 | 1237 | 1,5,2,4,3, |
Can anybody offer any guidance as to how I can use collect_list to achieve my desired results set?
From your SQL, it appears you have a column called start_time in the table. Assuming you have one, see the solution below
Table
CREATE EXTERNAL TABLE my_table(
event_date DATE,
event_start_timestamp TIMESTAMP,
event_timestamp TIMESTAMP,
user_id STRING,
visit_id STRING,
event STRING,
event_seq STRING)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '|'
STORED AS TEXTFILE;
Data
2017-10-25| 2017-10-25 04:58:08| 2017-10-25 04:58:08| 12345678 | 1234 | firstname | 1
2017-10-25| 2017-10-25 04:58:08| 2017-10-25 04:58:20| 12345678 | 1234 | lastname | 2
2017-10-25| 2017-10-25 04:58:08| 2017-10-25 04:58:35| 12345678 | 1234 | dob | 3
2017-10-25| 2017-10-25 04:58:08| 2017-10-25 04:58:40| 12345678 | 1234 | postcode | 4
2017-10-25| 2017-10-25 04:58:08| 2017-10-25 04:58:40| 12345678 | 1234 | email | 5
2017-10-25| 2017-10-25 05:00:02| 2017-10-25 05:00:02| 12345679 | 1235 | firstname | 1
2017-10-25| 2017-10-25 05:00:02| 2017-10-25 05:00:10| 12345679 | 1235 | lastname | 2
2017-10-25| 2017-10-25 05:05:06| 2017-10-25 05:05:06| 12345680 | 1236 | firstname | 1
2017-10-25| 2017-10-25 05:30:24| 2017-10-25 05:30:24| 12345680 | 1236 | firstname | 1
2017-10-25| 2017-10-25 06:17:24| 2017-10-25 06:17:24| 12345680 | 1236 | firstname | 1
2017-10-25| 2017-10-25 06:17:24| 2017-10-25 06:20:30| 12345680 | 1236 | lastname | 2
2017-10-25| 2017-10-25 06:17:24| 2017-10-25 06:20:45| 12345680 | 1236 | dob | 3
2017-10-25| 2017-10-25 06:17:24| 2017-10-25 06:20:45| 12345680 | 1236 | postcode | 4
2017-10-25| 2017-10-25 06:17:24| 2017-10-25 06:20:45| 12345680 | 1236 | email | 5
2017-10-25| 2017-10-25 06:31:12| 2017-10-25 06:31:12| 12345681 | 1237 | firstname | 1
2017-10-25| 2017-10-25 06:31:12| 2017-10-25 06:31:18| 12345681 | 1237 | email | 5
2017-10-25| 2017-10-25 06:31:12| 2017-10-25 06:31:50| 12345681 | 1237 | lastname | 2
2017-10-25| 2017-10-25 06:31:12| 2017-10-25 06:32:16| 12345681 | 1237 | postcode | 4
2017-10-25| 2017-10-25 06:31:12| 2017-10-25 06:32:40| 12345681 | 1237 | dob | 3
SQL Query
SELECT event_date,
user_id,
visit_id,
event_start_timestamp,
collect_list(event_seq)
FROM (SELECT event_date,
event_start_timestamp,
event_timestamp,
user_id,
visit_id,
event_seq
FROM my_table
SORT BY user_id, visit_id, event_start_timestamp, event_timestamp ASC) v
GROUP BY event_date, user_id, visit_id, event_start_timestamp ;
Output
2017-10-25 12345678 1234 2017-10-25 04:58:08 [" 1 "," 2 "," 3 "," 4 "," 5 "]
2017-10-25 12345679 1235 2017-10-25 05:00:02 [" 1 "," 2 "]
2017-10-25 12345680 1236 2017-10-25 05:05:06 [" 1 "]
2017-10-25 12345680 1236 2017-10-25 05:30:24 [" 1 "]
2017-10-25 12345680 1236 2017-10-25 06:17:24 [" 1 "," 2 "," 3 "," 4 "," 5 "]
2017-10-25 12345681 1237 2017-10-25 06:31:12 [" 1 "," 5 "," 2 "," 4 "," 3 "]
Let us know if this works!
And please don't use column names as DATE, TIMESTAMP etc, they are reserved words :)