Say I have a BQ table called employee_entries
that is created and populated by the following SQL statements:
CREATE OR REPLACE TABLE `my-project.my_dataset.employee_entries` (
employee_id STRING,
entry_datetime DATETIME,
entry_flag STRING,
salary FLOAT64
);
INSERT INTO `my-project.my_dataset.employee_entries`
(employee_id, entry_datetime, entry_flag, salary) VALUES
('1234', '2023-07-15 08:00:00', '1', 50000),
('1234', '2023-07-15 09:00:00', '2', 50000),
('1234', '2023-07-15 10:00:00', '0', 50000),
('1234', '2023-07-15 11:00:00', '0', 50000),
('1234', '2023-07-15 12:00:00', '1', 50000),
('1234', '2023-07-15 13:00:00', '3', 50000),
('5678', '2023-07-15 08:30:00', '2', 75000),
('5678', '2023-07-15 09:30:00', '2', 75000),
('5678', '2023-07-15 10:30:00', '3', 75000),
('5678', '2023-07-15 11:30:00', '4', 75000),
('5678', '2023-07-15 12:30:00', '4', 75000),
('5678', '2023-07-15 13:30:00', '0', 75000);
This will create the following example table:
employee_id | entry_datetime | entry_flag | salary |
---|---|---|---|
1234 | 2023-07-15 08:00:00 | 1 | 50000 |
1234 | 2023-07-15 09:00:00 | 2 | 50000 |
1234 | 2023-07-15 10:00:00 | 0 | 50000 |
1234 | 2023-07-15 11:00:00 | 0 | 50000 |
1234 | 2023-07-15 12:00:00 | 1 | 50000 |
1234 | 2023-07-15 13:00:00 | 3 | 50000 |
5678 | 2023-07-15 08:30:00 | 2 | 75000 |
5678 | 2023-07-15 09:30:00 | 2 | 75000 |
5678 | 2023-07-15 10:30:00 | 3 | 75000 |
5678 | 2023-07-15 11:30:00 | 4 | 75000 |
5678 | 2023-07-15 12:30:00 | 4 | 75000 |
5678 | 2023-07-15 13:30:00 | 6 | 75000 |
Problem:
I want to create a new table called employees_extended
that has all the columns in this table but one additional column as well called entry_switches.
This entry_switches column is calculated from the other columns as such:
For each row in the original employee_entries
table, for a particular employee_id, if all the rows for that employee are put in chronological order according to entry_datetime,
I want to count the number of times the value of the entry_flag switched before that the current row within the past 2 hours;
a "switch" meaning that the value is different for two chronologically consecutive entries for that employee_id.
For example if, chronologically, there are 4 entries in the past 2 hours for employee_id='1234' prior to the current row, and the 4 values of entry_flag were ['1',NULL,'0','0'] then this returns an entry_switches value of 2 for that row (the "switch" from 1 to NULL and then NULL to 0. Going from 0 to 0 is not included as a switch).
Expected Result Table:
employee_id | entry_datetime | entry_flag | salary | entry_switches |
---|---|---|---|---|
1234 | 2023-07-15 08:00:00 | 1 | 50000 | 0 |
1234 | 2023-07-15 09:00:00 | 2 | 50000 | 1 |
1234 | 2023-07-15 10:00:00 | 0 | 50000 | 2 |
1234 | 2023-07-15 11:00:00 | 0 | 50000 | 1 |
1234 | 2023-07-15 12:00:00 | 1 | 50000 | 1 |
1234 | 2023-07-15 13:00:00 | 3 | 50000 | 2 |
5678 | 2023-07-15 08:30:00 | 2 | 75000 | 0 |
5678 | 2023-07-15 09:30:00 | 2 | 75000 | 0 |
5678 | 2023-07-15 10:30:00 | 3 | 75000 | 1 |
5678 | 2023-07-15 11:30:00 | 4 | 75000 | 2 |
5678 | 2023-07-15 12:30:00 | 4 | 75000 | 1 |
5678 | 2023-07-15 13:30:00 | 6 | 75000 | 1 |
How can I achieve this resulting table using a SQL query in BQ?
Edit:
My Attempt:
WITH prev_entries AS ( -- get previous entry flags for each row
SELECT *,
LAG(entry_flag) OVER (PARTITION BY employee_id ORDER BY entry_datetime) AS prev_entry_flag,
ROW_NUMBER() OVER (PARTITION BY employee_id ORDER BY entry_datetime) AS rn
FROM `my-project.my_dataset.employee_entries`
),
-- deals with situation where there is no previous entry for an employee_id
entry_changes AS (
SELECT * EXCEPT (prev_entry_flag, rn),
CASE
WHEN rn=1 THEN entry_flag
ELSE prev_entry_flag
END AS prev_entry_flag
FROM prev_entries
)
SELECT employee_id,
entry_datetime,
entry_flag,
prev_entry_flag,
-- trying find a way to change this window fucntion to only count the last 2 hours, or find an alternative solution to this problem
SUM(CASE WHEN entry_flag IS DISTINCT FROM prev_entry_flag THEN 1 ELSE 0 END) OVER (PARTITION BY employee_id ORDER BY entry_datetime ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS switch_count
FROM entry_changes
ORDER BY
employee_id, entry_datetime;
If we can not use rows range by time in window function, we can (self) JOIN table with condition as
ROWS between <preceding 2 hours> and current
We use comparison as >=
and <=
to include first and current row.
Then, calculate entry_flag changes for every row for left table.
See example
with ranged_data as(
select ee1.*
,ee2.id id2,ee2.entry_datetime edt2, ee2.entry_flag flag2
,case when lag(ee2.entry_flag,1,ee2.entry_flag)
over(partition by ee1.employee_id,ee1.entry_datetime
order by ee2.entry_datetime)
=ee2.entry_flag
then 0
when coalesce(
lag(ee2.entry_flag,1,ee2.entry_flag)
over(partition by ee1.employee_id,ee1.entry_datetime
order by ee2.entry_datetime)
,ee2.entry_flag) is null
then 0
else 1
end chng_flag
from employee_entries ee1
left join employee_entries ee2 on ee1.employee_id=ee2.employee_id
and ee2.entry_datetime<=ee1.entry_datetime
-- and ee2.entry_datetime>=dateadd(hour,-2,ee1.entry_datetime)
and ee2.entry_datetime>=datetime_add(ee1.entry_datetime, interval -2 hour)
)
select id,employee_id,entry_datetime,entry_flag,salary
,sum(chng_flag) entry_switches
,string_agg(id2,'-') ids
,string_agg(case when flag2 is null then 'null' else flag2 end,'-') entry_flags
from ranged_data
group by employee_id,entry_datetime,entry_flag,salary
Used simplified check for null=null
as coalesce(null, null) is null. May be, IS DISTINCT FROM is better here.
I will add id column to table only for clarity.