Search code examples
sqlgoogle-bigqueryinterpolationlinear-interpolation

How to fill irregularly missing values with linear interepolation in BigQuery?


I have data which has missing values irregulaly, and I'd like to convert it with a certain interval with liner interpolation using BigQuery Standard SQL.

Specifically, I have data like this:

# data is missing irregulary
+------+-------+
| time | value |
+------+-------+
|    1 | 3.0   |
|    5 | 5.0   |
|    7 | 1.0   |
|    9 | 8.0   |
|   10 | 4.0   |
+------+-------+

and I'd like to convert this table as follows:

# interpolated with interval of 1
+------+--------------------+
| time | value_interpolated |
+------+--------------------+
|    1 | 3.0                |
|    2 | 3.5                |
|    3 | 4.0                |
|    4 | 4.5                |
|    5 | 5.0                |
|    6 | 3.0                |
|    7 | 1.0                |
|    8 | 4.5                |
|    9 | 8.0                |
|   10 | 4.0                |
+------+--------------------+

Any smart soluton for this?

Supplement: this question is similar to this question in stackoverflow but different in that the data is missing irregulaly.

Thank you.


Solution

  • Below is for BigQuery Standard SQL

    #standardSQL
    select time,
      ifnull(value, start_value + (end_value - start_value) / (end_tick - start_tick) * (time - start_tick)) as value_interpolated
    from (
      select time, value,
        first_value(tick ignore nulls) over win1 as start_tick,
        first_value(value ignore nulls) over win1 as start_value,
        first_value(tick ignore nulls) over win2 as end_tick,
        first_value(value ignore nulls) over win2 as end_value,
      from (
        select time, t.time as tick, value
        from (
          select generate_array(min(time), max(time)) times
          from `project.dataset.table`
        ), unnest(times) time 
        left join `project.dataset.table` t
        using(time)
      )
      window win1 as (order by time desc rows between current row and unbounded following),
      win2 as (order by time rows between current row and unbounded following)
    )
    

    if to apply to sample data from your question - output is

    enter image description here