Search code examples
sqlsql-serverazurequery-optimization

How to optimize the SQL query which uses cte, group by and running for 20+ hours?


I have a query which first performs joins inside cte and then based on cte join with other table and perform group by with some aggregations and get the top 15 records.

WITH join_table as (
    SELECT pl.yearmonth, 
    pl.pdid, 
    pl.OrderId,
    pl.OrderNo AS OrderNumber, 
    pl.pmc,
    pl.pmcode AS pmaa,
    pl.spid,
    pl.issuingdate,
    pl.plguid, 
    pusa.SizeCode,
    pusa.atnumber,
    pu.tpc, 
    pu.ItemQty AS puItemQty
    FROM table1 pl 
    JOIN table3 pusa ON pl.plguid=pusa.plguid
    JOIN table2 pu ON pusa.plguid=pu.plguid AND pusa.puguid=pu.puguid
    WHERE pl.pmode='BBB'
      and pl.pltype='CCC'
      and pl.plstatus='AAA'
      and pu.ctcode='S'
      AND pu.ctype='DDD'
      AND pu.tpc <> 'EEE' 
),
get_top_15PM as (
    SELECT TOP 15 pl.pmc, sum(cast(puItemQty as BIGINT)) AS SumpuItemQty
    FROM join_table pl
    join abctable b on b.pdid= pl.pdid and b.spid= pl.spid
    WHERE pl.issuingdate > '2021-8-1'
    and prodtypeid in (select prodtypeid from prodtype where prodgrpid in (3,4,5,7,8,9,10,13,20))
    group by pl.pmc
    ORDER BY SumpuItemQty DESC
)
SELECT DISTINCT 
    pl.yearmonth, 
    pl.pdid, 
    pl.OrderId,
    pl.OrderNumber, 
    pl.pmc,
    pl.pmaa,
    pl.spid,
    pl.issuingdate, 
    pl.atnumber,
    pl.tpc, 
    pl.puItemQty,
    s.SizeName AS Size,
    s.SizeLength,
    s.SizeWidth,
    s.SizeHeight,
    s.SizeVolume,
    s.SizeWeight
FROM join_table pl 
JOIN PLSize s ON pl.plguid=s.plguid AND pl.SizeCode=s.SizeCode
WHERE  pl.pmc IN ( SELECT pmc from get_top_15PM) 
      AND pl.yearmonth>=202100

This query is taking 20+ hours to run on Azure SQL Database.

Details:

Azure database pricing tier: S6 with 750GB storage with 20% unused storage space left

TableName   rows         TotalSpaceGB   UsedSpaceGB UnusedSpaceGB
table2      332,318,173  117.72         117.71      0.01
table3      153,700,352  60.78          60.76       0.01
table1      15,339,815   13.21          13.20       0.01
abctable    1,232,868    0.81           0.80        0.00

Estimation execution plan with query

Actual Execution plan with query

wait type: (14ms)PAGEIOLATCH_SH:dev-db:1(*) or NULL sometimes. Got this using sp_WhoIsActive.

Note:

Actual execution plan was taken from the running query as shown in this SO answer

This tables contains clustered index on primary keys and nonclustered index on other fields which I am not using in filters or joins or order by.

table1 is parent of table2 and table2 is parent of table3

Any advice or suggestions are highly appreciable.

Update-1

--table1
index_name          index_description                 index_keys
idx_table1_pmcode   nonclustered                      pmcode
PK_table1           clustered, unique, primary key    plguid

--table2
index_name          index_description                 index_keys
IX_table2_plguid    nonclustered                      plguid
PK_table2           clustered, unique, primary key    puguid

--table3
index_name          index_description                 index_keys
IX_table3_plguid    nonclustered                      plguid
PK_table3           clustered, unique, primary key    pusguid

--abctable
index_name             index_description               index_keys
nci_wi_abctable_2BC1D  nonclustered                    dpyearmonth
nci_wi_abctable_FAA89  nonclustered                    dpnumber, ptid
PK_abctable            clustered, unique, primary key  OrderId

DDL

--table1
SET ANSI_PADDING ON
GO

CREATE NONCLUSTERED INDEX [idx_table1_pmcode] ON [dbo].[table1]
(
    [pmcode] ASC
)
INCLUDE([OrderID],[OrderNo],[yearmonth],[pdid]) WITH (STATISTICS_NORECOMPUTE = OFF, DROP_EXISTING = OFF, ONLINE = OFF, OPTIMIZE_FOR_SEQUENTIAL_KEY = OFF) ON [PRIMARY]
GO

--table2
SET ANSI_PADDING ON
GO

CREATE NONCLUSTERED INDEX [IX_table2_plguid] ON [dbo].[table2]
(
    [plguid] ASC
)WITH (STATISTICS_NORECOMPUTE = OFF, DROP_EXISTING = OFF, ONLINE = OFF, OPTIMIZE_FOR_SEQUENTIAL_KEY = OFF) ON [PRIMARY]
GO

--table3
SET ANSI_PADDING ON
GO

CREATE NONCLUSTERED INDEX [IX_table3_plguid] ON [dbo].[table3]
(
    [plguid] ASC
)WITH (STATISTICS_NORECOMPUTE = OFF, DROP_EXISTING = OFF, ONLINE = OFF, OPTIMIZE_FOR_SEQUENTIAL_KEY = OFF) ON [PRIMARY]
GO

--abctable
SET ANSI_PADDING ON
GO


CREATE NONCLUSTERED INDEX [nci_wi_abctable_2BC1D] ON [dbo].[abctable]
(
    [dpyearmonth] ASC
)
INCLUDE([pduid],[pdid]) WITH (STATISTICS_NORECOMPUTE = OFF, DROP_EXISTING = OFF, ONLINE = OFF, OPTIMIZE_FOR_SEQUENTIAL_KEY = OFF) ON [PRIMARY]
GO

CREATE NONCLUSTERED INDEX [nci_wi_abctable_FAA89] ON [dbo].[abctable]
(
    [dpyearmonth] ASC,
    [ptid] ASC
)
INCLUDE([OrderStatus],[spid]) WITH (STATISTICS_NORECOMPUTE = OFF, DROP_EXISTING = OFF, ONLINE = OFF, OPTIMIZE_FOR_SEQUENTIAL_KEY = OFF) ON [PRIMARY]
GO

Solution

  • Keep in mind that SQL is a declarative language. You tell it what you want, and the query engine figures out how to get it. By contrast, python and other languages are procedural. You tell them how to get what you want. This is relevant because SQL Server's query planner shuffles around the terms in your CTEs and main query. It doesn't usually use the CTEs as procedures for getting your results.

    The server's query planner uses indexes wherever they are helpful to random-access your table data. SQL Server indexes are usually B-TREE indexes. You can think of a B-TREE index as if it were an old-school telephone directory book. To find me (O.Jones), for example, you'll open up the book to the letter J. Then you'll look for the first "Jones" in the surnames. Then you'll look for the first "O" in the given-name column. Then you'll scan through the "O"s to try and figure out which one is me.

    Indexes work the same way, and random-accessing them to the first interesting entry is a very fast O(log(n)) operation. Then scanning the entries one by one from there is O(1). So you want to use an index. Scanning through the book looking at every entry would take an absurd amount of time. Like 20 hours or more. The point of an index is to reduce the number of entries we must scan through one by one.

    For this to work, you need a multicolumn index appropriate to your query. Let's look at your table1 (the one with a third of a gigarow in it). Various parts of your large query apply WHERE filters to these columns. (This may be incomplete, your query is large.)

    Column Expression Type of expression
    pmode pl.mode='BBB' Equality
    pltype pl.pltype='CCC' Equality
    plstatus pl.plstatus='AAA' Equality
    issuingdate pl.issuingdate > '2021-8-1' Open-ended range scan
    yearmonth pl.yearmonth >= 202100 Open-ended range scan

    So, we have three equality matches and two range scans here, which is a reasonably good situation for a B-TREE index. This index might help.

    CREATE INDEX mode_pltype_plstatus_issuingdate
              ON table1(mode, pltype, plstatus, issuingdate);
    

    Edit A multi-column B-TREE index is most useful when it contains zero or more columns for equality matching, and then just one column for range scanning. This query calls for two range scans. I chose issuingdate for the index's single range scan column, because I guess it will be more selective -- match fewer rows -- than yearmonth. But that's a guess on my part.

    The query planner will random-access this index to the first matching row of that big table, then scan the index until it gets to the last matching row, then stop. That's faster than looking through the whole table.

    Also, your actual execution plan for the running query recommends creating this index.

    CREATE INDEX plguid_sizecode 
              ON table3(plguid, SizeCode)
         INCLUDE (puguid, atnumber);
    

    That index will help accelerate two different JOINs, the last one in the query and the first one in the first CTE.

    The indexes recommended by SQL Server actual execution plans are generally useful and worth paying attention to.

    Pro tip Read Markus Winand's e-book https://use-the-index-luke.com/ to learn a lot about this topic.

    Pro tip Be careful with datestamp inequalities. pl.issuingdate > '2021-8-1' removes rows occurring exactly at midnight on 1-Aug-2021, but takes all the rows from that day after midnight. It seems unlikely that's what you want.