I am currently trying to improve a SQL query on SQL Server.
My working table looks like this :
CAT_HISTORY
DATE ID CATEGORY
----------- ----------- -----------
20121201 A 1
20121201 A 1
20121201 B 1
20121201 C 2
20131201 A 2
20131201 B 4
20131201 C 3
20141201 A 3
20141201 B 2
20141201 B 2
20141201 C 1
And my objective is to retrieve the history of their category. So far I am doing this :
SELECT A.DATE
,COUNT(DISTINCT A.ID) AS NB_CLIENTS
,A.CATEGORY AS STARTING_CAT
,B.CATOGORY AS ENDING_CAT
FROM CAT_HISTORY A
INNER JOIN CAT_HISTORY B
ON (
A.ID= B.ID
AND
(
(
A.DATE = 20121201
AND B.DATE = 20131201
)
OR
(
A.DATE = 20131201
AND B.DATE = 20141201
)
WHERE A.DATE>= 20121201 AND B.DATE<= 20141201
GROUP BY A.DATE, A.CATEGORY,B.CATEGORY
ORDER BY A.DATE, A.CATEGORY,B.CATEGORY
And the result is :
DATE_KEY STARTING_CAT ENDING_CAT NB_CLIENTS
----------- ----------- ----------- -----------
20121201 1 2 1
20121201 1 4 1
20121201 2 3 1
20131201 2 3 1
20131201 4 2 1
20131201 2 3 1
But the thing is that I have way more dates, and I add an OR for every date (about 15 distinct dates), and I have loads of users. Which means the query sometimes takes up to 15min to get results.
I believe I am being brutal on my INNER JOINs, and that there probably is a more elegant and efficient way of getting the expected result.
My final objective is to get a Sankey to see the evolutions from one category to another through time, and I need the amount of users that moved from a category to another between to dates.
Using Gordon Linoff's answer, it worked nicely but is counting duplicates
SELECT DISTINCT DATE, CATEGORY,NEXT_CATEGORY, COUNT(*) AS NB_CLIENTS
FROM (
SELECT DISTINCT CH.*, LEAD(CATEGORY) OVER (PARTITION BY CH.ID ORDER BY DATE) AS NEXT_CATEGORY
FROM CAT_HISTORY CH
) CH
WHERE NEXT_CATEGORY IS NOT NULL
GROUP BY DATE, CATEGORY,NEXT_CATEGORY
Example : Expected
DATE_KEY STARTING_CAT ENDING_CAT NB_CLIENTS
----------- ----------- ----------- -----------
20121201 1 2 1
20121201 1 4 1
20121201 2 3 1
20131201 2 3 1
20131201 4 2 1
20131201 2 3 1
With your solution :
DATE_KEY STARTING_CAT ENDING_CAT NB_CLIENTS
----------- ----------- ----------- -----------
20121201 1 1 1
20121201 1 2 1
20121201 1 4 1
20121201 2 3 1
20131201 2 3 1
20131201 4 2 1
20131201 2 3 1
20141201 2 2 1
Last edit :
I managed to find a workaround :
SELECT DISTINCT DATE, CATEGORY,NEXT_CATEGORY, COUNT(*) AS NB_CLIENTS
FROM (
SELECT DISTINCT CH.*, LEAD(CATEGORY) OVER (PARTITION BY CH.ID ORDER BY DATE) AS NEXT_CATEGORY
FROM (SELECT DISTINCT * FROM CAT_HISTORY) CH
) CH
WHERE NEXT_CATEGORY IS NOT NULL
GROUP BY DATE, CATEGORY,NEXT_CATEGORY
If you want to see pair-wise changes, then use lead()
rather than fixed dates. In SQL Server 2012+, you can do:
select date, category, next_category, count(*)
from (select ch.*,
lead(category) over (partition by id order by date) as next_category
from cat_history ch
) ch
group by date, category, next_category;
In earlier versions of SQL Server, you can use similar logic with a correlated subquery or apply
.