I’m doing a project with the BigQuery USA Facts Covid-19 open dataset. The data looks like this:
I’m trying to create a query that gives me a % change (up or down) in the 7 day covid case data by county. The end result would be the county, the date and the percentage change of the seven day moving average of covid cases. Ultimately this would allow me to show where cases are relatively stable versus where they are increasing, AKA a hotspot.
I'm new to working with LAG and OVER. So I'm pretty sure I'm just missing some basic order by or group by within my CTE.
It’s odd because when I only select out one County (where county_name=”X”), I'm able to get the 7 Day moving average just fine -- it just gives me a nice percentage for each day that tells me whether it is increasing or decreasing. The problem is when I am not selecting just one County, I'm just not able to figure out what I need to do or what I need to change in order to still get that same value. I end up getting values that make no sense. I'm pretty sure this is because I'm just using the window function wrong.
Here’s my code:
WITH
a AS (SELECT long.*,
deaths-lag(deaths) over (order by date) as deaths_increase,
confirmed_cases - lag(confirmed_cases) over (order by date) as cases_increase,
FROM `bigquery-public-data.covid19_usafacts.summary` as long
where date >= cast('2020-05-03' as date)
)
,b as (
SELECT
a.*,
AVG(a.deaths_increase) OVER(ORDER BY a.date ROWS BETWEEN 6 PRECEDING AND CURRENT ROW) AS seven_day_avg_deaths,
AVG(a.cases_increase) OVER(ORDER BY a.date ROWS BETWEEN 6 PRECEDING AND CURRENT ROW) AS seven_day_avg_cases
FROM a
order by a.county_name
)
select
b.county_name,
b.county_fips_code,
b.confirmed_cases,
b.cases_increase,
b.deaths,
b.state,
b.seven_day_avg_cases,
b.date,
(b.seven_day_avg_cases - lag(b.seven_day_avg_cases) OVER( ORDER BY b.date)) / b.seven_day_avg_cases * 100 as seven_day_percent_change
from b
where seven_day_avg_cases > 0
order by date desc
Below is for BigQuery Standard SQL
You should add PARTITION BY county_name
to ALL OVER(...) statements in your query
After that, your query can look like below
#standardSQL
WITH a AS (
SELECT long.*,
deaths-lag(deaths) OVER(PARTITION BY county_name ORDER BY DATE) AS deaths_increase,
confirmed_cases - LAG(confirmed_cases) OVER (PARTITION BY county_name ORDER BY DATE) AS cases_increase,
FROM `bigquery-public-data.covid19_usafacts.summary` AS long
WHERE DATE >= CAST('2020-05-03' AS DATE)
), b AS (
SELECT a.*,
AVG(a.deaths_increase) OVER(PARTITION BY county_name ORDER BY a.date ROWS BETWEEN 6 PRECEDING AND CURRENT ROW) AS seven_day_avg_deaths,
AVG(a.cases_increase) OVER(PARTITION BY county_name ORDER BY a.date ROWS BETWEEN 6 PRECEDING AND CURRENT ROW) AS seven_day_avg_cases
FROM a
)
SELECT
b.county_name,
b.county_fips_code,
b.confirmed_cases,
b.cases_increase,
b.deaths,
b.state,
b.seven_day_avg_cases,
b.date,
(b.seven_day_avg_cases - LAG(b.seven_day_avg_cases) OVER(PARTITION BY county_name ORDER BY b.date)) / b.seven_day_avg_cases * 100 AS seven_day_percent_change
FROM b
WHERE seven_day_avg_cases > 0
ORDER BY DATE DESC, county_name
Note: obviously assuming that your original query really works for one country
Yet another weak point in your query is ORDER BY a.date ROWS BETWEEN 6 PRECEDING AND CURRENT ROW
- this sets window of 7 consecutive rows (not days) which means that this workrks ONLY if you have all days in the stats - which most likely a case for this data. But still more proper use would be to use ORDER BY UNIX_DATE(a.date) RANGE BETWEEN 6 PRECEDING AND CURRENT ROW
instead - this guarantees you to use windows of seven days even if some days are missing or filtered out for whatever reason, etc.