Search code examples
google-bigquerydataform

Impute missing days with copy of last non-missing day in BigQuery


For some reason, I miss the ingestion of three days worth of data in a bigquery table. Now, I know that simply copying data from the last non-missing day is not the best way to impute missing data, but for my purposes, this is good enough.

I know that I could copy the last missing day, transform the date in pandas to DATE + 1, DATE +2 and so on and then append that data to the original table in bigquery. But, I would rather avoid having to do this. Is there a good and easy way to do this directly in bigquery or with dataform? I am not very comfortable with SQL.

Thanks for any given advice.


Solution

  • You can do the following. The query is self explanatory, but here is some details:

    • use the DATE_ADD() and DATE_SUB() to modify the data returned and to filter the day you want to copy from.

    • Use the union to return a single table many times with different modification and filters

    • Use the insert as described following to insert the retrieved data in the table.

    • Before run the insert, run only the selects and unions to check if that is the data you want

    • I've returned data from 1, 2 and 3 days ago (date_col = DATE_SUB(CURRENT_DATE(), interval 2 DAY)) and added 1 day on if date field.

    INSERT INTO `<p>.<ds>.<t>` (date_col, data)  (
        SELECT DATE_ADD(date_col, INTERVAL 1 DAY) as date, data FROM `<p>.<ds>.<t>` where date_col =  DATE_SUB(CURRENT_DATE(), interval 1 DAY)
        UNION ALL
        SELECT DATE_ADD(date_col, INTERVAL 1 DAY) as date, data FROM `<p>.<ds>.<t>` where date_col =  DATE_SUB(CURRENT_DATE(), interval 2 DAY)
        UNION ALL
        SELECT DATE_ADD(date_col, INTERVAL 1 DAY) as date, data FROM `<p>.<ds>.<t>` where date_col =  DATE_SUB(CURRENT_DATE(), interval 3 DAY)
    
    )