Search code examples
sqlsqlitesql-function

How to access previous rows in custom SQL scalar function?


I want to write my own SQL scalar function that calculates the mean between data points of a single column over a variable number of rows, based on a second column. I envision this function looking something like WINDOWMEAN(data, mileage, 100). This would compute the mean of all data over the rows where the mileage is within 100 miles of that of the current row.

Example:

| data | mileage | 
|  10  |  1000   |
|  15  |  1009   |
|  20  |  1056   |
|  16  |  1098   |
|  13  |  1130   |
|  14  |  1200   |

With a query like SELECT WINDOWMEAN(data, mileage, 100) AS a FROM t, I would expect:

| data | mileage | a    |
|  10  |  1000   | 10   |
|  15  |  1009   | 12.5 |
|  20  |  1056   | 15   |
|  16  |  1098   | 15.25|
|  13  |  1130   | 16.33|
|  14  |  1200   | 13.5 |

a is calculated for each row as the mean of all rows within 100 miles of mileage that come before the current row.

Where I am getting stuck is how to access previous rows in a custom SQL function. I am not certain what I am trying to accomplish is even possible because I've yet to find documentation for accessing other rows in this way.

Whether specific to this solution or not, is there any way to access previous rows in a custom SQL scalar function?

(I can make the assumption that the rows are ordered by mileage)


Solution

  • You can use a window function with the appropriate RANGE frame definition to limit the window to rows within 100 miles of the current one:

    SELECT data, mileage
         , avg(data) OVER (ORDER BY mileage RANGE BETWEEN 100 PRECEDING AND CURRENT ROW) AS a
    FROM t
    ORDER BY mileage;
    
    data        mileage     a
    ----------  ----------  ----------
    10          1000        10.0
    15          1009        12.5
    20          1056        15.0
    16          1098        15.25
    13          1130        16.3333333
    14          1200        13.5
    

    Notes:

    This query requires Sqlite 3.28 or newer, as that version had some major enhancements to window functions to allow, among other things, numeric ranges like this.

    For best results, create an index on t(mileage) or a covering index on t(mileage, data).


    Non-window function version using a correlated subquery (Also works better with that index):

    SELECT data, mileage
         , (SELECT avg(t2.data) FROM t AS t2
            WHERE t2.mileage BETWEEN t1.mileage - 100 AND t1.mileage) AS a
    FROM t AS t1
    ORDER BY mileage;