I'm trying to find the rolling line of best fit for a set of data, when we look at groups of five points at a time, ordered by the x value. In other words:
Here's the values I'm aiming for, in an Excel sheet and plot. The values for slope
and yIntercept
are correct according to pen-and-paper and online linear regression calculation:
...and here's the SQL I have so far:
WITH dataset AS (
SELECT 1 AS x, 9 AS y UNION ALL
SELECT 2 AS x, 7 AS y UNION ALL
SELECT 3 AS x, 5 AS y UNION ALL
SELECT 4 AS x, 3 AS y UNION ALL
SELECT 5 AS x, 1 AS y UNION ALL
SELECT 6 AS x, 1 AS y UNION ALL
SELECT 7 AS x, 1 AS y UNION ALL
SELECT 8 AS x, 1 AS y UNION ALL
SELECT 9 AS x, 1 AS y
),
rollingAverages AS (
SELECT
dataset.*,
AVG(dataset.x * 1.00) OVER (ORDER BY x ROWS BETWEEN 4 PRECEDING AND CURRENT ROW) AS [xMean],
AVG(dataset.y * 1.00) OVER (ORDER BY x ROWS BETWEEN 4 PRECEDING AND CURRENT ROW) AS [yMean],
SUM(1) OVER (ORDER BY x ROWS BETWEEN 4 PRECEDING AND CURRENT ROW) AS [yCount]
FROM dataset
),
mValue AS (
SELECT
*,
CASE WHEN yCount < 5 THEN NULL ELSE x - yCount + 1 END AS xStart,
CASE WHEN yCount < 5 THEN NULL ELSE x END AS xEnd,
CASE
WHEN yCount < 5 THEN NULL
WHEN SUM((x - xMean) * (x - xMean)) OVER (ORDER BY x ROWS BETWEEN 4 PRECEDING AND CURRENT ROW) = 0
THEN 0
ELSE
SUM((x - xMean) * (y - yMean)) OVER (ORDER BY x ROWS BETWEEN 4 PRECEDING AND CURRENT ROW)
/ SUM((x - xMean) * (x - xMean)) OVER (ORDER BY x ROWS BETWEEN 4 PRECEDING AND CURRENT ROW)
END AS slope
FROM rollingAverages
),
-- This is the y intercept at the start of the range, i.e. 40 trading days before "today"
yIntercept AS (
SELECT
*,
yMean - slope * xMean AS yIntercept
FROM mValue
),
channelNowMidpoint AS (
SELECT
*
FROM yIntercept
)
SELECT *
FROM channelNowMidpoint
ORDER BY x
I'm not getting the correct values for slope
or yIntercept
, I think because the line-of-best-fit algorithm I'm using expects an unbounded set of values, so the calculations I get for xMean
and yMean
have lost context by the time I get to the CTE named mValue
. For reference, you can find a line-of-best-fit algorithm that uses the "least squares" method here.
See below for the values I'm getting when I run this SQL in SSMS:
As you can see, where x = 5
, the slope
or yIntercept
are correct, but after that it is incorrect. I'm not sure where I'm going wrong and how to get the values I'm aiming for.
Ok, I've figured it out. The problem with using window functions in this case is we're getting the values of xMean
and yMean
at the row being processed, not at the lead row of the window.
In order to fix this, the mValue
CTE needs to join back to the dataset
CTE to get the values for x
and y
then stop using window functions, so that the xMean
and yMean
values are static for each row:
mValue AS (
SELECT
ra.*,
CASE WHEN ra.yCount < 5 THEN NULL ELSE ra.x - ra.yCount + 1 END AS xStart,
CASE WHEN ra.yCount < 5 THEN NULL ELSE ra.x END AS xEnd,
CASE
WHEN ra.yCount < 5 THEN NULL
WHEN SUM((ds.x - xMean) * (ds.x - xMean)) = 0
THEN 0
ELSE
SUM((ds.x - xMean) * (ds.y - yMean)) / SUM((ds.x - xMean) * (ds.x - xMean))
END AS slope
FROM rollingAverages AS ra
INNER JOIN dataset AS ds
ON ra.x - ds.x BETWEEN 0 AND 4
GROUP BY
ra.x,
ra.y,
ra.xMean,
ra.yMean,
ra.yCount
),
Results: