Search code examples
dategoogle-bigquerygrouping

BigQuery SQL: Group rows with shared ID that occur within 7 days of each other, and return values from most recent occurrence


I have a table of datestamped events that I need to bundle into 7-day groups, starting with the earliest occurrence of each event_id.

The final output should return each bundle's start and end date and 'value' column of the most recent event from each bundle.

There is no predetermined start date, and the '7-day' windows are arbitrary, not 'week of the year'.

I've tried a ton of examples from other posts but none quite fit my needs or use things I'm not sure how to refactor for BigQuery

Sample Data;

Event_Id Event_Date Value
1 2022-01-01 010203
1 2022-01-02 040506
1 2022-01-03 070809
1 2022-01-20 101112
1 2022-01-23 131415
2 2022-01-02 161718
2 2022-01-08 192021
3 2022-02-12 212223

Expected output;

Event_Id Start_Date End_Date Value
1 2022-01-01 2022-01-03 070809
1 2022-01-20 2022-01-23 131415
2 2022-01-02 2022-01-08 192021
3 2022-02-12 2022-02-12 212223

Solution

  • You might consider below.

    CREATE TEMP FUNCTION cumsumbin(a ARRAY<INT64>) RETURNS INT64
    LANGUAGE js AS """
      bin = 0;
      a.reduce((c, v) => {
        if (c + Number(v) > 6) { bin += 1; return 0; }
        else return c += Number(v); 
      }, 0);
    
      return bin;
    """;
    
    WITH sample_data AS (
        select 1 event_id, DATE '2022-01-01' event_date, '010203' value union all
        select 1 event_id, '2022-01-02' event_date, '040506' value union all
        select 1 event_id, '2022-01-03' event_date, '070809' value union all
        select 1 event_id, '2022-01-20' event_date, '101112' value union all
        select 1 event_id, '2022-01-23' event_date, '131415' value union all
        select 2 event_id, '2022-01-02' event_date, '161718' value union all
        select 2 event_id, '2022-01-08' event_date, '192021' value union all
        select 3 event_id, '2022-02-12' event_date, '212223' value
    ),
    binning AS (
      SELECT *, cumsumbin(ARRAY_AGG(diff) OVER w1) bin
        FROM (
          SELECT *, DATE_DIFF(event_date, LAG(event_date) OVER w0, DAY) AS diff
            FROM sample_data
          WINDOW w0 AS (PARTITION BY event_id ORDER BY event_date)
        ) WINDOW w1 AS (PARTITION BY event_id ORDER BY event_date)
    )
    SELECT event_id, 
           MIN(event_date) start_date,
           ARRAY_AGG(
             STRUCT(event_date AS end_date, value) ORDER BY event_date DESC LIMIT 1
           )[OFFSET(0)].*
      FROM binning GROUP BY event_id, bin;
    

    enter image description here