Search code examples
sqlgoogle-bigquerywindow-functions

How to split dates into groups of 24 hours -BigQuery


this is my goal, split the dates into groups of 24 hours, but those depends of the data, not of a very specific date

Let's say these are the calls from a call center, I want to know how many sessions I had, but those sessions are valid for 24 hours, those 24 hours are starting to count since the first date_sent, if the next call is later of those first 24 hours, a new session would be created

The expected results are like this: First 3 columns are those I already have in the table, fourth one is the required calculation

identifier  customer_id date_sent               StartOfSession          sessionId
456456150   5366    2020-09-01T10:17:48.360000  2020-09-01T10:17:48.360000  1
456456150   5366    2020-09-01T18:24:45.552000  2020-09-01T10:17:48.360000  1
456456150   5366    2020-09-02T10:20:46.283000  2020-09-02T10:20:46.283000  2
456456150   5366    2020-09-02T18:25:01.911000  2020-09-02T10:20:46.283000  2
456456150   5366    2020-09-03T10:20:38.407000  2020-09-02T10:20:46.283000  2
456456150   5366    2020-09-03T18:23:35.915000  2020-09-03T18:23:35.915000  3
456456150   5366    2020-09-04T10:19:46.474000  2020-09-03T18:23:35.915000  3
456456150   5366    2020-09-04T14:22:17.236000  2020-09-03T18:23:35.915000  3
456456150   5366    2020-09-04T18:24:33.155000  2020-09-04T18:24:33.155000  4
456456150   5366    2020-09-05T10:19:48.871000  2020-09-04T18:24:33.155000  4
456456150   5366    2020-09-05T18:25:07.968000  2020-09-05T18:25:07.968000  5
456456150   5366    2020-09-06T10:19:34.808000  2020-09-05T18:25:07.968000  5
456456150   5366    2020-09-06T18:26:17.705000  2020-09-06T18:26:17.705000  6
456456150   5366    2020-09-07T10:21:28.585000  2020-09-06T18:26:17.705000  6
456456150   5366    2020-09-07T18:24:17.123000  2020-09-06T18:26:17.705000  6
456456150   5366    2020-09-08T10:20:09.850000  2020-09-08T10:20:09.850000  7
456456150   5366    2020-09-08T18:24:32.733000  2020-09-08T10:20:09.850000  7
456456150   5366    2020-09-09T10:20:05.336000  2020-09-08T10:20:09.850000  7
456456150   5366    2020-09-09T12:12:41.137000  2020-09-09T12:12:41.137000  8
456456150   5366    2020-09-09T18:24:25.783000  2020-09-09T12:12:41.137000  8

I've tried using window functions, but I cannot achieve the same expected results:

  SELECT identifier, customer_id, date_sent,
    FIRST_VALUE(date_sent) OVER (PARTITION BY A.identifier, A.customer_id, CAST(A.date_sent AS DATE) ORDER BY UNIX_SECONDS(TIMESTAMP(date_sent)) RANGE BETWEEN 86400 PRECEDING AND CURRENT ROW) FirstV_date1
  FROM `sandbox.testing` A
  WHERE identifier = '456456150'
AND date_sent between '2020-09-01' AND '2020-09-10'

Those would be my actual results

identifier  customer_id date                    FirstV_date1
456456150   5366    2020-09-01T10:17:48.360000  2020-09-01T10:17:48.360000
456456150   5366    2020-09-01T18:24:45.552000  2020-09-01T10:17:48.360000
456456150   5366    2020-09-02T10:20:46.283000  2020-09-02T10:20:46.283000
456456150   5366    2020-09-02T18:25:01.911000  2020-09-02T10:20:46.283000
456456150   5366    2020-09-03T10:20:38.407000  2020-09-03T10:20:38.407000
456456150   5366    2020-09-03T18:23:35.915000  2020-09-03T10:20:38.407000
456456150   5366    2020-09-04T10:19:46.474000  2020-09-04T10:19:46.474000
456456150   5366    2020-09-04T14:22:17.236000  2020-09-04T10:19:46.474000
456456150   5366    2020-09-04T18:24:33.155000  2020-09-04T10:19:46.474000
456456150   5366    2020-09-05T10:19:48.871000  2020-09-05T10:19:48.871000
456456150   5366    2020-09-05T18:25:07.968000  2020-09-05T10:19:48.871000
456456150   5366    2020-09-06T10:19:34.808000  2020-09-06T10:19:34.808000
456456150   5366    2020-09-06T18:26:17.705000  2020-09-06T10:19:34.808000
456456150   5366    2020-09-07T10:21:28.585000  2020-09-07T10:21:28.585000
456456150   5366    2020-09-07T18:24:17.123000  2020-09-07T10:21:28.585000
456456150   5366    2020-09-08T10:20:09.850000  2020-09-08T10:20:09.850000
456456150   5366    2020-09-08T18:24:32.733000  2020-09-08T10:20:09.850000
456456150   5366    2020-09-09T10:20:05.336000  2020-09-09T10:20:05.336000
456456150   5366    2020-09-09T12:12:41.137000  2020-09-09T10:20:05.336000
456456150   5366    2020-09-09T18:24:25.783000  2020-09-09T10:20:05.336000

I've also tried using self join, but I rather not because that's very expensive, but, any ideas would be welcome.

Thanks in advance!


Solution

  • Below is for BigQuery Standard SQL (assumes that date_sent column is of timestamp data type - as it looks like in provided example)

    #standardSQL
    create temp function get_sessions(x array<timestamp>)
    returns array<struct<date_sent timestamp, sessionStart timestamp, session string>>
    language js as """
      output = []; session = 1; sessionStart = x[0]; total_dur = 0;  
      a = {}; a.date_sent = x[0]; a.session = session; a.sessionStart = sessionStart;
      output.push(a);
      for(i = 1; i < x.length; i++){
        a = {};
        total_dur += x[i].getTime() - x[i-1].getTime();
        if(total_dur>24*3600*1000){
          total_dur = 0; session++; sessionStart = x[i];  
        };
        a.date_sent = x[i-1]; a.sessionStart = sessionStart; a.session = session;
        output.push(a);
      }
      return output;
    """;
    select identifier, customer_id, date_sent, sessionStart, session
    from (
      select identifier, customer_id, get_sessions(array_agg(date_sent order by date_sent)) sessions
      from `project.dataset.table`
      group by identifier, customer_id
    ), unnest(sessions)
    

    if to apply to sample data in the question - output is

    enter image description here

    Important: assumption here - based on your comments - volume of rows per partition (identifier, customer_id) is relatively small (~2K) so js udf memory limit is not a problem here)