Search code examples
sqlsql-serversql-server-2012query-optimization

SQL Server Query optimization : too many self inner join


I am currently trying to improve a SQL query on SQL Server.

My working table looks like this :

CAT_HISTORY

DATE        ID          CATEGORY
----------- ----------- -----------
20121201    A           1
20121201    A           1
20121201    B           1
20121201    C           2
20131201    A           2
20131201    B           4
20131201    C           3
20141201    A           3
20141201    B           2
20141201    B           2
20141201    C           1

And my objective is to retrieve the history of their category. So far I am doing this :

 SELECT   A.DATE
         ,COUNT(DISTINCT A.ID) AS NB_CLIENTS
         ,A.CATEGORY           AS STARTING_CAT
         ,B.CATOGORY           AS ENDING_CAT

FROM CAT_HISTORY A
INNER JOIN CAT_HISTORY B 
ON (
     A.ID= B.ID
 AND
 ( 
    ( 
          A.DATE = 20121201
      AND B.DATE = 20131201 
    )
  OR  
    (
          A.DATE  = 20131201
      AND B.DATE  = 20141201
    )

  WHERE A.DATE>= 20121201 AND B.DATE<= 20141201
  GROUP BY A.DATE, A.CATEGORY,B.CATEGORY
  ORDER BY A.DATE, A.CATEGORY,B.CATEGORY

And the result is :

DATE_KEY   STARTING_CAT ENDING_CAT     NB_CLIENTS 
-----------  -----------  -----------  -----------
20121201     1            2            1
20121201     1            4            1
20121201     2            3            1
20131201     2            3            1
20131201     4            2            1
20131201     2            3            1

But the thing is that I have way more dates, and I add an OR for every date (about 15 distinct dates), and I have loads of users. Which means the query sometimes takes up to 15min to get results.

I believe I am being brutal on my INNER JOINs, and that there probably is a more elegant and efficient way of getting the expected result.

My final objective is to get a Sankey to see the evolutions from one category to another through time, and I need the amount of users that moved from a category to another between to dates.

Using Gordon Linoff's answer, it worked nicely but is counting duplicates

SELECT DISTINCT DATE, CATEGORY,NEXT_CATEGORY, COUNT(*) AS NB_CLIENTS
FROM (  
        SELECT DISTINCT CH.*, LEAD(CATEGORY) OVER (PARTITION BY CH.ID ORDER BY DATE) AS NEXT_CATEGORY
        FROM CAT_HISTORY CH 
        ) CH
        WHERE  NEXT_CATEGORY IS NOT NULL
        GROUP BY DATE, CATEGORY,NEXT_CATEGORY

Example : Expected

DATE_KEY     STARTING_CAT ENDING_CAT   NB_CLIENTS 
-----------  -----------  -----------  -----------
20121201     1            2            1
20121201     1            4            1
20121201     2            3            1
20131201     2            3            1
20131201     4            2            1
20131201     2            3            1

With your solution :

DATE_KEY     STARTING_CAT ENDING_CAT   NB_CLIENTS 
-----------  -----------  -----------  -----------
20121201     1            1            1
20121201     1            2            1
20121201     1            4            1
20121201     2            3            1
20131201     2            3            1
20131201     4            2            1
20131201     2            3            1
20141201     2            2            1

Last edit :

I managed to find a workaround :

 SELECT DISTINCT DATE, CATEGORY,NEXT_CATEGORY, COUNT(*) AS NB_CLIENTS
    FROM (  
            SELECT DISTINCT CH.*, LEAD(CATEGORY) OVER (PARTITION BY CH.ID ORDER BY DATE) AS NEXT_CATEGORY
            FROM (SELECT DISTINCT * FROM CAT_HISTORY) CH 
            ) CH
            WHERE  NEXT_CATEGORY IS NOT NULL
            GROUP BY DATE, CATEGORY,NEXT_CATEGORY

Solution

  • If you want to see pair-wise changes, then use lead() rather than fixed dates. In SQL Server 2012+, you can do:

    select date, category, next_category, count(*)
    from (select ch.*,
                 lead(category) over (partition by id order by date) as next_category
          from cat_history ch
         ) ch
    group by date, category, next_category;
    

    In earlier versions of SQL Server, you can use similar logic with a correlated subquery or apply.