I'm querying a large data set to figure out if a bunch of campaign events (i.e. event 1,2,..) during different timepoints gives a result in user activity (active, inactive) during the following 3 days after each event (but not in the same day as the campaign event itself).
I'm merging two tables to do this, and they look like this merged:
| date | user | events | day_activity |
| 2020-01-01 | 1 | event1 | active |
| 2020-01-01 | 2 | event1 | inactive |
| 2020-01-02 | 1 | null | inactive |
| 2020-01-02 | 2 | null | active |
| 2020-01-03 | 1 | null | inactive |
| 2020-01-03 | 2 | null | active |
| 2020-01-04 | 1 | null | active |
| 2020-01-04 | 2 | null | active |
What I am trying to achieve is, for each user/date/event gang (= row) where an event occured, to add another column called 3_day_activity
, containing the activity not on the event (= current row) day but the following 3 days only (giving a score of 1 per active day). An example for how the 1st day of this table would look after (I add * in the activity days counted in the added column for user 1, and # for the events counted in the column for user 2)):
| date | user | events | day_activity | 3_day_activity
| 2020-01-01 | 1 | event1 | active | 1
| 2020-01-01 | 2 | event1 | inactive | 3
| 2020-01-02 | 1 | null | inactive * (0)| null (bco no event)
| 2020-01-02 | 2 | null | active # (1) | null (bco no event)
| 2020-01-03 | 1 | null | inactive * (0)| null (bco no event)
| 2020-01-03 | 2 | null | active # (1) | null (bco no event)
| 2020-01-04 | 1 | null | active * (1) | null (bco no event)
| 2020-01-04 | 2 | null | active # (1) | null (bco no event)
I tried solving this with a window function. It runs, but I think I misunderstood some important idea on how to design it, because the result contains a ton of repetitions...
cm.date,
cm.user,
event,
day_activity,
COUNTIF(active_today = 'active') OVER 3d_later AS 3_day_activity
FROM `customer_message` cm
INNER JOIN `customer_day` ud
ON cm.user = ud.user
AND cm.date = ud.date
WHERE
cm.date > '2019-12-25'
WINDOW 3d_later AS (PARTITION BY user ORDER BY UNIX_DATE(cm.date) RANGE BETWEEN 1 FOLLOWING AND 3 FOLLOWING)
EDIT:
I was asked to supply an example of how this repetition might look. Here's what I see if I add an "ORDER BY 3_day_activity" clause at the end of the query:
Row date user day_activity 3_day_activity
1 2020-01-01 2 active 243
2 2020-01-01 2 active 243
3 2020-01-01 2 active 243
4 2020-01-01 2 active 243
5 2020-01-01 2 active 243
6 2020-01-01 2 active 243
7 2020-01-02 2 active 243
8 2020-01-02 2 active 243
EDIT2 :
This remains unsolved.. I have tried asking another question, as per the suggestion of one commenter, but I am locked from doing so even if the problem is not identical (I suppose due to the similarities to this one). I have tested grouping based on user and date, but I then it instead throws an error due to not aggregating in the 'COUNTIF' clause.
This is the attempt mentioned; SQL: Error demanding aggregation when counting, grouping and windowing
Below example is for BigQuery Standard SQL
#standardSQL
SELECT *, IF(events IS NULL, 0, COUNTIF(day_activity = 'active') OVER(three_day_activity_window)) AS three_day_activity
FROM `project.dataset.table`
WINDOW three_day_activity_window AS (
PARTITION BY user
ORDER BY UNIX_DATE(date)
RANGE BETWEEN 1 FOLLOWING AND 3 FOLLOWING
)
You can test, play with above using sample data from your question as in below example
#standardSQL
WITH `project.dataset.table` AS (
SELECT DATE '2020-01-01' date , 1 user, 'event1' events, 'active' day_activity UNION ALL
SELECT '2020-01-01', 2, 'event1', 'inactive' UNION ALL
SELECT '2020-01-02', 1, NULL, 'inactive' UNION ALL
SELECT '2020-01-02', 2, NULL, 'active' UNION ALL
SELECT '2020-01-03', 1, NULL, 'inactive' UNION ALL
SELECT '2020-01-03', 2, NULL, 'active' UNION ALL
SELECT '2020-01-04', 1, NULL, 'active' UNION ALL
SELECT '2020-01-04', 2, NULL, 'active'
)
SELECT *, IF(events IS NULL, 0, COUNTIF(day_activity = 'active') OVER(three_day_activity_window)) AS three_day_activity
FROM `project.dataset.table`
WINDOW three_day_activity_window AS (
PARTITION BY user
ORDER BY UNIX_DATE(date)
RANGE BETWEEN 1 FOLLOWING AND 3 FOLLOWING
)
ORDER BY date, user
with output
Row date user events day_activity three_day_activity
1 2020-01-01 1 event1 active 1
2 2020-01-01 2 event1 inactive 3
3 2020-01-02 1 null inactive 0
4 2020-01-02 2 null active 0
5 2020-01-03 1 null inactive 0
6 2020-01-03 2 null active 0
7 2020-01-04 1 null active 0
8 2020-01-04 2 null active 0
Update for - to avoid registering the same user as active multiple times in one day (and tallying those up to a huge sum)?
If you want to avoid counting all activity for user on same day - use below adjusted version (note extra entry in sample data to introduce user's multiple activity on same day)
#standardSQL
WITH `project.dataset.table` AS (
SELECT DATE '2020-01-01' DATE , 1 user, 'event1' events, 'active' day_activity UNION ALL
SELECT '2020-01-01', 2, 'event1', 'inactive' UNION ALL
SELECT '2020-01-02', 1, NULL, 'inactive' UNION ALL
SELECT '2020-01-02', 2, NULL, 'active' UNION ALL
SELECT '2020-01-03', 1, NULL, 'inactive' UNION ALL
SELECT '2020-01-03', 2, NULL, 'active' UNION ALL
SELECT '2020-01-04', 1, NULL, 'active' UNION ALL
SELECT '2020-01-04', 1, NULL, 'active' UNION ALL
SELECT '2020-01-04', 2, NULL, 'active'
)
SELECT *,
IF(events IS NULL, 0, COUNTIF(day_activity = 'active') OVER(three_day_activity_window)) AS three_day_activity
FROM (
SELECT date, user, MAX(events) events, MIN(day_activity) day_activity
FROM `project.dataset.table`
GROUP BY date, user
)
WINDOW three_day_activity_window AS (
PARTITION BY user
ORDER BY UNIX_DATE(date)
RANGE BETWEEN 1 FOLLOWING AND 3 FOLLOWING
)
ORDER BY date, user