Search code examples
sqlgoogle-bigquery

For a BQ SQL table, calculate the number of times the value in a given column switches in a given time period, partitioned by an ID value


Say I have a BQ table called employee_entries that is created and populated by the following SQL statements:

CREATE OR REPLACE TABLE `my-project.my_dataset.employee_entries`  (
  employee_id STRING,
  entry_datetime DATETIME,
  entry_flag STRING,
  salary FLOAT64
);


INSERT INTO `my-project.my_dataset.employee_entries`  
(employee_id, entry_datetime, entry_flag, salary) VALUES
('1234', '2023-07-15 08:00:00', '1', 50000),
('1234', '2023-07-15 09:00:00', '2', 50000),
('1234', '2023-07-15 10:00:00', '0', 50000),
('1234', '2023-07-15 11:00:00', '0', 50000),
('1234', '2023-07-15 12:00:00', '1', 50000),
('1234', '2023-07-15 13:00:00', '3', 50000),
('5678', '2023-07-15 08:30:00', '2', 75000),
('5678', '2023-07-15 09:30:00', '2', 75000),
('5678', '2023-07-15 10:30:00', '3', 75000),
('5678', '2023-07-15 11:30:00', '4', 75000),
('5678', '2023-07-15 12:30:00', '4', 75000),
('5678', '2023-07-15 13:30:00', '0', 75000);

This will create the following example table:

employee_id entry_datetime entry_flag salary
1234 2023-07-15 08:00:00 1 50000
1234 2023-07-15 09:00:00 2 50000
1234 2023-07-15 10:00:00 0 50000
1234 2023-07-15 11:00:00 0 50000
1234 2023-07-15 12:00:00 1 50000
1234 2023-07-15 13:00:00 3 50000
5678 2023-07-15 08:30:00 2 75000
5678 2023-07-15 09:30:00 2 75000
5678 2023-07-15 10:30:00 3 75000
5678 2023-07-15 11:30:00 4 75000
5678 2023-07-15 12:30:00 4 75000
5678 2023-07-15 13:30:00 6 75000

 

Problem:

I want to create a new table called employees_extended that has all the columns in this table but one additional column as well called entry_switches.

This entry_switches column is calculated from the other columns as such:

For each row in the original employee_entries table, for a particular employee_id, if all the rows for that employee are put in chronological order according to entry_datetime,

I want to count the number of times the value of the entry_flag switched before that the current row within the past 2 hours;

a "switch" meaning that the value is different for two chronologically consecutive entries for that employee_id.

For example if, chronologically, there are 4 entries in the past 2 hours for employee_id='1234' prior to the current row, and the 4 values of entry_flag were ['1',NULL,'0','0'] then this returns an entry_switches value of 2 for that row (the "switch" from 1 to NULL and then NULL to 0. Going from 0 to 0 is not included as a switch).

Expected Result Table:

employee_id entry_datetime entry_flag salary entry_switches
1234 2023-07-15 08:00:00 1 50000 0
1234 2023-07-15 09:00:00 2 50000 1
1234 2023-07-15 10:00:00 0 50000 2
1234 2023-07-15 11:00:00 0 50000 1
1234 2023-07-15 12:00:00 1 50000 1
1234 2023-07-15 13:00:00 3 50000 2
5678 2023-07-15 08:30:00 2 75000 0
5678 2023-07-15 09:30:00 2 75000 0
5678 2023-07-15 10:30:00 3 75000 1
5678 2023-07-15 11:30:00 4 75000 2
5678 2023-07-15 12:30:00 4 75000 1
5678 2023-07-15 13:30:00 6 75000 1

How can I achieve this resulting table using a SQL query in BQ?

Edit:

My Attempt:

WITH prev_entries AS ( -- get previous entry flags for each row
  SELECT *,
  LAG(entry_flag) OVER (PARTITION BY employee_id ORDER BY entry_datetime) AS prev_entry_flag,
  ROW_NUMBER() OVER (PARTITION BY employee_id ORDER BY entry_datetime) AS rn
  FROM `my-project.my_dataset.employee_entries`
),
-- deals with situation where there is no previous entry for an employee_id
entry_changes AS (
  SELECT * EXCEPT (prev_entry_flag, rn),
  CASE
    WHEN rn=1 THEN entry_flag
    ELSE prev_entry_flag
  END AS prev_entry_flag
  FROM prev_entries
)
SELECT employee_id,
entry_datetime,
entry_flag,
prev_entry_flag,
-- trying find a way to change this window fucntion to only count the last 2 hours, or find an alternative solution to this problem
SUM(CASE WHEN entry_flag IS DISTINCT FROM prev_entry_flag THEN 1 ELSE 0 END) OVER (PARTITION BY employee_id ORDER BY entry_datetime ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS switch_count
FROM entry_changes
ORDER BY 
employee_id, entry_datetime;

Solution

  • If we can not use rows range by time in window function, we can (self) JOIN table with condition as

    ROWS between <preceding 2 hours> and current
    

    We use comparison as >= and <= to include first and current row.

    Then, calculate entry_flag changes for every row for left table.

    See example

    with ranged_data as(
    select ee1.*
      ,ee2.id id2,ee2.entry_datetime edt2, ee2.entry_flag flag2
      ,case when lag(ee2.entry_flag,1,ee2.entry_flag)
                    over(partition by ee1.employee_id,ee1.entry_datetime 
                         order by ee2.entry_datetime)
             =ee2.entry_flag 
          then 0 
        when coalesce(
          lag(ee2.entry_flag,1,ee2.entry_flag)
                    over(partition by ee1.employee_id,ee1.entry_datetime 
                         order by ee2.entry_datetime)
         ,ee2.entry_flag) is null
          then 0  
       else 1
      end chng_flag
    from employee_entries ee1
    left join employee_entries ee2 on ee1.employee_id=ee2.employee_id
      and ee2.entry_datetime<=ee1.entry_datetime
    --  and ee2.entry_datetime>=dateadd(hour,-2,ee1.entry_datetime)
      and ee2.entry_datetime>=datetime_add(ee1.entry_datetime, interval -2 hour)
    )
    select id,employee_id,entry_datetime,entry_flag,salary
      ,sum(chng_flag) entry_switches
      ,string_agg(id2,'-') ids
      ,string_agg(case when flag2 is null then 'null' else flag2 end,'-') entry_flags
    from ranged_data
    group by employee_id,entry_datetime,entry_flag,salary
    

    Used simplified check for null=null as coalesce(null, null) is null. May be, IS DISTINCT FROM is better here.
    I will add id column to table only for clarity.

    Demo for SQL Server