Search code examples
sqldatetimecountgoogle-bigquerywindow-functions

SQL query to answer: If <event 1> occurs in timepoint A, does <event 2> occur in time period B-C?


I'm querying a large data set to figure out if a bunch of campaign events (i.e. event 1,2,..) during different timepoints gives a result in user activity (active, inactive) during the following 3 days after each event (but not in the same day as the campaign event itself).

I'm merging two tables to do this, and they look like this merged:

| date       | user | events | day_activity  | 
| 2020-01-01 | 1    | event1 | active        | 
| 2020-01-01 | 2    | event1 | inactive      | 
| 2020-01-02 | 1    | null   | inactive      |  
| 2020-01-02 | 2    | null   | active        | 
| 2020-01-03 | 1    | null   | inactive      | 
| 2020-01-03 | 2    | null   | active        | 
| 2020-01-04 | 1    | null   | active        | 
| 2020-01-04 | 2    | null   | active        | 

What I am trying to achieve is, for each user/date/event gang (= row) where an event occured, to add another column called 3_day_activity, containing the activity not on the event (= current row) day but the following 3 days only (giving a score of 1 per active day). An example for how the 1st day of this table would look after (I add * in the activity days counted in the added column for user 1, and # for the events counted in the column for user 2)):

| date       | user | events | day_activity  | 3_day_activity
| 2020-01-01 | 1    | event1 | active        | 1
| 2020-01-01 | 2    | event1 | inactive      | 3
| 2020-01-02 | 1    | null   | inactive * (0)| null (bco no event)
| 2020-01-02 | 2    | null   | active # (1)  | null (bco no event)
| 2020-01-03 | 1    | null   | inactive * (0)| null (bco no event)
| 2020-01-03 | 2    | null   | active # (1)  | null (bco no event)
| 2020-01-04 | 1    | null   | active * (1)  | null (bco no event)
| 2020-01-04 | 2    | null   | active # (1)  | null (bco no event)

I tried solving this with a window function. It runs, but I think I misunderstood some important idea on how to design it, because the result contains a ton of repetitions...

  cm.date, 
  cm.user,
  event,
  day_activity,
  COUNTIF(active_today = 'active') OVER 3d_later AS 3_day_activity
FROM `customer_message` cm
INNER JOIN `customer_day` ud
  ON cm.user = ud.user
  AND cm.date = ud.date
WHERE 
  cm.date > '2019-12-25'
WINDOW 3d_later AS (PARTITION BY user ORDER BY UNIX_DATE(cm.date) RANGE BETWEEN 1 FOLLOWING AND 3 FOLLOWING)

EDIT:

I was asked to supply an example of how this repetition might look. Here's what I see if I add an "ORDER BY 3_day_activity" clause at the end of the query:

Row    date         user  day_activity  3_day_activity  
1      2020-01-01   2     active        243
2      2020-01-01   2     active        243
3      2020-01-01   2     active        243
4      2020-01-01   2     active        243
5      2020-01-01   2     active        243
6      2020-01-01   2     active        243
7      2020-01-02   2     active        243
8      2020-01-02   2     active        243

EDIT2 :

This remains unsolved.. I have tried asking another question, as per the suggestion of one commenter, but I am locked from doing so even if the problem is not identical (I suppose due to the similarities to this one). I have tested grouping based on user and date, but I then it instead throws an error due to not aggregating in the 'COUNTIF' clause.

This is the attempt mentioned; SQL: Error demanding aggregation when counting, grouping and windowing


Solution

  • Below example is for BigQuery Standard SQL

    #standardSQL
    SELECT *, IF(events IS NULL, 0, COUNTIF(day_activity = 'active') OVER(three_day_activity_window)) AS three_day_activity
    FROM `project.dataset.table`
    WINDOW three_day_activity_window AS (
      PARTITION BY user 
      ORDER BY UNIX_DATE(date) 
      RANGE BETWEEN 1 FOLLOWING AND 3 FOLLOWING
    )
    

    You can test, play with above using sample data from your question as in below example

    #standardSQL
    WITH `project.dataset.table` AS (
      SELECT DATE '2020-01-01' date , 1 user, 'event1' events, 'active' day_activity UNION ALL
      SELECT '2020-01-01', 2, 'event1', 'inactive' UNION ALL
      SELECT '2020-01-02', 1, NULL, 'inactive' UNION ALL
      SELECT '2020-01-02', 2, NULL, 'active' UNION ALL
      SELECT '2020-01-03', 1, NULL, 'inactive' UNION ALL
      SELECT '2020-01-03', 2, NULL, 'active' UNION ALL
      SELECT '2020-01-04', 1, NULL, 'active' UNION ALL
      SELECT '2020-01-04', 2, NULL, 'active' 
    )
    SELECT *, IF(events IS NULL, 0, COUNTIF(day_activity = 'active') OVER(three_day_activity_window)) AS three_day_activity
    FROM `project.dataset.table`
    WINDOW three_day_activity_window AS (
      PARTITION BY user 
      ORDER BY UNIX_DATE(date) 
      RANGE BETWEEN 1 FOLLOWING AND 3 FOLLOWING
    )
    ORDER BY date, user   
    

    with output

    Row date        user    events  day_activity    three_day_activity   
    1   2020-01-01  1       event1  active          1    
    2   2020-01-01  2       event1  inactive        3    
    3   2020-01-02  1       null    inactive        0    
    4   2020-01-02  2       null    active          0    
    5   2020-01-03  1       null    inactive        0    
    6   2020-01-03  2       null    active          0    
    7   2020-01-04  1       null    active          0    
    8   2020-01-04  2       null    active          0       
    

    Update for - to avoid registering the same user as active multiple times in one day (and tallying those up to a huge sum)?

    If you want to avoid counting all activity for user on same day - use below adjusted version (note extra entry in sample data to introduce user's multiple activity on same day)

    #standardSQL
    WITH `project.dataset.table` AS (
      SELECT DATE '2020-01-01' DATE , 1 user, 'event1' events, 'active' day_activity UNION ALL
      SELECT '2020-01-01', 2, 'event1', 'inactive' UNION ALL
      SELECT '2020-01-02', 1, NULL, 'inactive' UNION ALL
      SELECT '2020-01-02', 2, NULL, 'active' UNION ALL
      SELECT '2020-01-03', 1, NULL, 'inactive' UNION ALL
      SELECT '2020-01-03', 2, NULL, 'active' UNION ALL
      SELECT '2020-01-04', 1, NULL, 'active' UNION ALL
      SELECT '2020-01-04', 1, NULL, 'active' UNION ALL
      SELECT '2020-01-04', 2, NULL, 'active' 
    )
    SELECT *, 
      IF(events IS NULL, 0, COUNTIF(day_activity = 'active') OVER(three_day_activity_window)) AS three_day_activity
    FROM (
      SELECT date, user, MAX(events) events, MIN(day_activity) day_activity
      FROM `project.dataset.table` 
      GROUP BY date, user
    )
    WINDOW three_day_activity_window AS (
      PARTITION BY user 
      ORDER BY UNIX_DATE(date) 
      RANGE BETWEEN 1 FOLLOWING AND 3 FOLLOWING
    )
    ORDER BY date, user