Search code examples
sqlgoogle-bigquerywindow-functions

How to use LAG and OVER in BigQuery to calculate a percentage change in a weekly moving average?


I’m doing a project with the BigQuery USA Facts Covid-19 open dataset. The data looks like this: enter image description here

I’m trying to create a query that gives me a % change (up or down) in the 7 day covid case data by county. The end result would be the county, the date and the percentage change of the seven day moving average of covid cases. Ultimately this would allow me to show where cases are relatively stable versus where they are increasing, AKA a hotspot.

I'm new to working with LAG and OVER. So I'm pretty sure I'm just missing some basic order by or group by within my CTE.

It’s odd because when I only select out one County (where county_name=”X”), I'm able to get the 7 Day moving average just fine -- it just gives me a nice percentage for each day that tells me whether it is increasing or decreasing. The problem is when I am not selecting just one County, I'm just not able to figure out what I need to do or what I need to change in order to still get that same value. I end up getting values that make no sense. I'm pretty sure this is because I'm just using the window function wrong.

Here’s my code:


WITH 
a AS (SELECT long.*,
deaths-lag(deaths) over (order by date) as deaths_increase,
confirmed_cases - lag(confirmed_cases) over (order by date) as cases_increase,
FROM `bigquery-public-data.covid19_usafacts.summary` as long
where date >= cast('2020-05-03' as date)


)
,b as (
 SELECT
 a.*,
 AVG(a.deaths_increase) OVER(ORDER BY a.date ROWS BETWEEN 6 PRECEDING AND CURRENT ROW) AS seven_day_avg_deaths,
 AVG(a.cases_increase) OVER(ORDER BY a.date ROWS BETWEEN 6 PRECEDING AND CURRENT ROW) AS seven_day_avg_cases
FROM a
order by a.county_name

)

select 
b.county_name, 
b.county_fips_code,
b.confirmed_cases,
b.cases_increase,
b.deaths,
b.state,
b.seven_day_avg_cases,
b.date,

(b.seven_day_avg_cases - lag(b.seven_day_avg_cases) OVER( ORDER BY b.date)) / b.seven_day_avg_cases * 100 as seven_day_percent_change

from b

where seven_day_avg_cases > 0


order by date desc

Solution

  • Below is for BigQuery Standard SQL

    You should add PARTITION BY county_name to ALL OVER(...) statements in your query

    After that, your query can look like below

    #standardSQL
    WITH a AS (
      SELECT long.*,
        deaths-lag(deaths) OVER(PARTITION BY county_name ORDER BY DATE) AS deaths_increase,
        confirmed_cases - LAG(confirmed_cases) OVER (PARTITION BY county_name ORDER BY DATE) AS cases_increase,
      FROM `bigquery-public-data.covid19_usafacts.summary` AS long
      WHERE DATE >= CAST('2020-05-03' AS DATE)
    ), b AS (
      SELECT a.*,
        AVG(a.deaths_increase) OVER(PARTITION BY county_name ORDER BY a.date ROWS BETWEEN 6 PRECEDING AND CURRENT ROW) AS seven_day_avg_deaths,
        AVG(a.cases_increase) OVER(PARTITION BY county_name ORDER BY a.date ROWS BETWEEN 6 PRECEDING AND CURRENT ROW) AS seven_day_avg_cases
      FROM a
    )
    SELECT 
      b.county_name, 
      b.county_fips_code,
      b.confirmed_cases,
      b.cases_increase,
      b.deaths,
      b.state,
      b.seven_day_avg_cases,
      b.date,
      (b.seven_day_avg_cases - LAG(b.seven_day_avg_cases) OVER(PARTITION BY county_name ORDER BY b.date)) / b.seven_day_avg_cases * 100 AS seven_day_percent_change
    FROM b
    WHERE seven_day_avg_cases > 0
    ORDER BY DATE DESC, county_name    
    

    Note: obviously assuming that your original query really works for one country

    Yet another weak point in your query is ORDER BY a.date ROWS BETWEEN 6 PRECEDING AND CURRENT ROW - this sets window of 7 consecutive rows (not days) which means that this workrks ONLY if you have all days in the stats - which most likely a case for this data. But still more proper use would be to use ORDER BY UNIX_DATE(a.date) RANGE BETWEEN 6 PRECEDING AND CURRENT ROW instead - this guarantees you to use windows of seven days even if some days are missing or filtered out for whatever reason, etc.