Search code examples
azure-data-lakeu-sql

USQL Nesting TVFs and Queries is giving horrendous results


I 'think' that this problem is relating to the query optimization that Azure Data Lake Analytics does; but let's see...

I have 2 separate queries (TVFs) doing aggregations, and then a final Query to join the 2 together for final results. So ...

Table >  Header Query
Table >  Detail Query
Result = Header Query + Detail Query

To test the whole logic out, I run the minor queries separately with a filter, storing the results to file, and then use the hard files as sources for the final query; these are the total durations (minutes).

Header Query  1.4  (408 rows)
Detail Query  0.9  (3298 rows)
Final Query   0.9  (408 rows)

So I know that as a maximum, I can get my result in around 3.5 minutes. However, I don't really want to create new intermediary files. I want to use the TDFs directly to feed the final query.

With TDFs in the final query, the Job Graph gets to around 97% progress within about 1.5 minutes. But then, all hell breaks loose ! The last node is a Aggregate with 2,500 Vertices that says 16 minutes compute time. So my question ... WHY ??

Is this a case of me not understanding some fundamental concepts of how Azure works ?

So, can anyone explain what's going on? Any help appreciated.

Final Query:

@Header =
SELECT [CTNNumber],
       [CTNCycleNo],
       [SeqStart],
       [SeqEnd],
       [StartUTC],
       [EndUTC],
       [StartLoc],
       [StartType],
       [EndLoc],
       [EndType],
       [Start Step],
       [Start Ctn Status],
       [Start Fill Status],
       [EndStep],
       [End Ctn Status],
       [End Fill Status]
FROM [Play].[getCycles3]
     ("") AS X;


@Detail =
SELECT [CTNNumber],
       [SeqNo] AS [SeqNo],
       [LocationType],
       [LocationID],
       [BizstepDescription],
       [ContainerStatus],
       [FillStatus],
       [UTCTimeStampforEvent]
FROM [Play].[getRaw]
     ("") AS Z;

@result =
    SELECT
        H.[CTNNumber], H.[CTNCycleNo], H.[SeqStart], H.[SeqEnd]
        ,COUNT([D].[SeqNo]) AS [SeqCount]
        //, COUNT(DISTINCT [LocationID]) AS [#Locations]
    FROM 
        @Header AS [H]
        INNER JOIN
        @Detail AS [D]
        ON 
        [H].[CTNNumber] == [D].[CTNNumber] 
    WHERE 
        [D].[SeqNo] >= [H].[SeqStart] AND
        [D].[SeqNo] <= [H].[SeqEnd]  
    GROUP BY 
        H.[CTNNumber], H.[CTNCycleNo], H.[SeqStart], H.[SeqEnd]
    ;

enter image description here


Solution

  • So I entered this one as a ticket with Microsoft. Here is their response, which I implemented and was successful.

    From: #######@microsoft.com Subject: RE: ########### USQL Job displays bizarre job plan and runtime Back

    So the issue here is cardinality. When the script gets broken into parts the U-SQL compiler has a new input to work with at each intermediate step, and it knows the actual data size and distribution of that new input, so the optimizer is able to accurately allocate resources.

    However, when the script is put all into one, the optimizer’s estimation is very different, because it doesn’t know what the exact size and distribution of those intermediate steps is going to be. So we expect there to be some difference in estimations between a script that has been broken into parts and a script that is all in one.

    Your best defense against optimization differences like these is to CREATE STATISTICS on table columns. In this case you should CREATE STATISTICS on the CTNNumber column and see if that improves performance on the single job script.

    Here is some documentation surrounding CREATE STATISTICS: https://learn.microsoft.com/en-us/sql/t-sql/statements/create-statistics-transact-sql