Search code examples
sqllinear-regressionmodeling

Find rolling line of best fit in SQL


I'm trying to find the rolling line of best fit for a set of data, when we look at groups of five points at a time, ordered by the x value. In other words:

  • For rows 1-4 there is no value, because we don't have 5 total values yet
  • For row 5, get the slope and yIntercept for rows 1-5
  • For row 6, get the slope and yIntercept for rows 2-6
  • For row 7, get the slope and yIntercept for rows 3-7
  • For row 8, get the slope and yIntercept for rows 4-8
  • For row 9, get the slope and yIntercept for rows 5-9

Here's the values I'm aiming for, in an Excel sheet and plot. The values for slope and yIntercept are correct according to pen-and-paper and online linear regression calculation:

Excel screenshot

...and here's the SQL I have so far:

WITH dataset AS (
    SELECT 1 AS x, 9 AS y UNION ALL
    SELECT 2 AS x, 7 AS y UNION ALL
    SELECT 3 AS x, 5 AS y UNION ALL
    SELECT 4 AS x, 3 AS y UNION ALL
    SELECT 5 AS x, 1 AS y UNION ALL
    SELECT 6 AS x, 1 AS y UNION ALL
    SELECT 7 AS x, 1 AS y UNION ALL
    SELECT 8 AS x, 1 AS y UNION ALL
    SELECT 9 AS x, 1 AS y
),
rollingAverages AS (
    SELECT
        dataset.*,
        AVG(dataset.x * 1.00) OVER (ORDER BY x ROWS BETWEEN 4 PRECEDING AND CURRENT ROW) AS [xMean],
        AVG(dataset.y * 1.00) OVER (ORDER BY x ROWS BETWEEN 4 PRECEDING AND CURRENT ROW) AS [yMean],
        SUM(1) OVER (ORDER BY x ROWS BETWEEN 4 PRECEDING AND CURRENT ROW) AS [yCount]
    FROM dataset
),
mValue AS (
    SELECT
        *,
        CASE WHEN yCount < 5 THEN NULL ELSE x - yCount + 1 END AS xStart,
        CASE WHEN yCount < 5 THEN NULL ELSE x END AS xEnd,
        CASE
            WHEN yCount < 5 THEN NULL
            WHEN SUM((x - xMean) * (x - xMean)) OVER (ORDER BY x ROWS BETWEEN 4 PRECEDING AND CURRENT ROW) = 0
            THEN 0
            ELSE
            SUM((x - xMean) * (y - yMean)) OVER (ORDER BY x ROWS BETWEEN 4 PRECEDING AND CURRENT ROW)
            / SUM((x - xMean) * (x - xMean)) OVER (ORDER BY x ROWS BETWEEN 4 PRECEDING AND CURRENT ROW)
        END AS slope
    FROM rollingAverages
),
-- This is the y intercept at the start of the range, i.e. 40 trading days before "today"
yIntercept AS (
    SELECT
        *,
        yMean - slope * xMean AS yIntercept
    FROM mValue
),
channelNowMidpoint AS (
    SELECT
        *
    FROM yIntercept
)

SELECT *
FROM channelNowMidpoint
ORDER BY x

I'm not getting the correct values for slope or yIntercept, I think because the line-of-best-fit algorithm I'm using expects an unbounded set of values, so the calculations I get for xMean and yMean have lost context by the time I get to the CTE named mValue. For reference, you can find a line-of-best-fit algorithm that uses the "least squares" method here.

See below for the values I'm getting when I run this SQL in SSMS:

SSMS output

As you can see, where x = 5, the slope or yIntercept are correct, but after that it is incorrect. I'm not sure where I'm going wrong and how to get the values I'm aiming for.


Solution

  • Ok, I've figured it out. The problem with using window functions in this case is we're getting the values of xMean and yMean at the row being processed, not at the lead row of the window.

    In order to fix this, the mValue CTE needs to join back to the dataset CTE to get the values for x and y then stop using window functions, so that the xMean and yMean values are static for each row:

    mValue AS (
        SELECT
            ra.*,
            CASE WHEN ra.yCount < 5 THEN NULL ELSE ra.x - ra.yCount + 1 END AS xStart,
            CASE WHEN ra.yCount < 5 THEN NULL ELSE ra.x END AS xEnd,
            CASE
                WHEN ra.yCount < 5 THEN NULL
                WHEN SUM((ds.x - xMean) * (ds.x - xMean)) = 0
                THEN 0
                ELSE
                SUM((ds.x - xMean) * (ds.y - yMean)) / SUM((ds.x - xMean) * (ds.x - xMean))
            END AS slope
        FROM rollingAverages AS ra
        INNER JOIN dataset AS ds
            ON ra.x - ds.x BETWEEN 0 AND 4
        GROUP BY
            ra.x,
            ra.y,
            ra.xMean,
            ra.yMean,
            ra.yCount
    ),
    

    Results:

    enter image description here