Search code examples
sql-servert-sql

Efficient way to forward-fill nulls in time-series data using T-SQL


I have a table with time-series data that's mostly nulls, and I want to fill in all of the nulls with the last known value.

I have a few solutions, but they're much slower than doing the equivalent DataFrame.fillna(method='ffill') operation in Pandas.

A simplified version of the code / data that I'm using:

select d.[date], d.[price],
       (select top 1 p.price from price_table p
        where p.price is not null and p.[date] <= p.[date]
        order by p.[date] desc) as ff_price
from price_table d

To produce the table

date       price ff_price
---------- ----- --------
2016-07-11 0.79  0.79
2016-07-12 NULL  0.79
2016-07-13 NULL  0.79
2016-07-14 0.69  0.69
2016-07-15 NULL  0.69
...
2016-09-21 0.88  0.88
...

I have >100 million rows, so this takes quite a while.


Solution

  • Assuming that your column is DATE and price is DECIMAL(5,2), please test this approach:

    SELECT
        P.[date],
        P.[price],
        ff_price = CONVERT(
            DECIMAL(5,2),       -- Original price datatype
            SUBSTRING(
                MAX(
                    CAST(P.[date] AS BINARY(3)) +   -- 3: datalength of P.[date] column
                    CAST(P.[price] AS BINARY(5))    -- 5: datalength of P.[price] column
                ) OVER (ORDER BY P.[date] ROWS UNBOUNDED PRECEDING),
    
                4,  -- Position to start that's not the binary part of the date
    
                5))-- Characters that compose the binary of the original price datatype
    FROM
        price_table  AS P
    

    This is a solution I implemented with a similar problem and you can find the exaustive explanation here. The reason this approach is good is because it doesn't require a explicit sort, as long as you have an index by date.

    What it does is basically use a windowed MAX with the concatenation of the 3 bytes that composes your date column (this is why I mentioned that you column must be DATE, otherwise DATETIME will need 8 bytes, you can edit the query to work with this) with the bytes that compose your price column (which are 5 bytes, also assumed). This is the CAST(P.[date] AS BINARY(3)) + CAST(P.[price] AS BINARY(5)) part.

    When you calculate this and ORDER BY P.[date] ROWS UNBOUNDED PRECEDING, the engine is basically doing rolling max with values which most significant bytes are your dates. The max result will always update when the date changes, but considering that concatenating any value with NULL as price will also yield NULL (as binary), then the MAX will always ignore this value and retain the previous non-null MAX (by P.[date] ROWS UNBOUNDED PRECEDING).

    This is the binary result of the windowed MAX (I added a previous record with NULL so you see that result is NULL for null prices values):

    date        price   ff_price    WindowedMax
    2016-07-10  NULL    NULL        NULL
    2016-07-11  0.79    0.79        0x9B3B0B050200014F
    2016-07-12  NULL    0.79        0x9B3B0B050200014F
    2016-07-13  NULL    0.79        0x9B3B0B050200014F
    2016-07-14  0.69    0.69        0x9E3B0B0502000145
    2016-07-15  NULL    0.69        0x9E3B0B0502000145
    2016-07-21  0.88    0.88        0xA53B0B0502000158
    2016-07-22  NULL    0.88        0xA53B0B0502000158