Search code examples
sqlteradatasql-optimization

OLAP Function Processing - Why is it faster to run on N/M partitions M times than N records 1 time


I have a (very large) table like this

CREATE SET TABLE LOAN 
  ( LoanNumber VARCHAR(100),
    LoanBalance DECIMAL(18,4),
    RecTimeStamp TIMESTAMP(0)
  )
PRIMARY INDEX (LoanNumber)
PARTITION BY RANGE_N
  ( ROW_INS_TS BETWEEN 
        TIMESTAMP '2017-01-01 00:00:00+00:00' 
    AND TIMESTAMP '2017-12-31 23:59:59+00:00' 
    EACH INTERVAL '1' DAY 
  );

This table typically gets rolled up by snapshots, for example the April month end snapshot would be

-- Pretend there is just 2017 data there
CREATE SET TABLE LOAN_APRIL AS 
  ( SELECT * 
      FROM LOAN
     WHERE RecTimeStamp <= DATE '2017-04-30'
   QUALIFY row_number() OVER
             ( PARTITION BY LoanNumber 
                   ORDER BY RecTimeStamp DESC
             ) = 1
  )
PRIMARY INDEX (LoanNumber);

Which typically takes quite awhile to run. I was experimenting yesterday though and found that I had very good execution time by breaking it apart like this

CREATE SET TABLE LOAN_APRIL_TMP
  ( LoanNumber VARCHAR(100),
    LoanBalance DECIMAL(18,4),
    RecTimeStamp TIMESTAMP(0)
  )
PRIMARY INDEX (LoanNumber);

CREATE SET TABLE LOAN_APRIL
  ( LoanNumber VARCHAR(100),
    LoanBalance DECIMAL(18,4),
    RecTimeStamp TIMESTAMP(0)
  )
PRIMARY INDEX (LoanNumber);

INSERT INTO LOAN_APRIL_TMP
    SELECT * 
      FROM LOAN
     WHERE RecTimeStamp BETWEEN DATE '2017-01-01' AND DATE '2017-01-31'
   QUALIFY row_number() OVER
             ( PARTITION BY LoanNumber 
                   ORDER BY RecTimeStamp DESC
             ) = 1;

INSERT INTO LOAN_APRIL_TMP
    SELECT * 
      FROM LOAN
     WHERE RecTimeStamp BETWEEN DATE '2017-02-01' AND DATE '2017-02-28'
   QUALIFY row_number() OVER
             ( PARTITION BY LoanNumber 
                   ORDER BY RecTimeStamp DESC
             ) = 1;

INSERT INTO LOAN_APRIL_TMP
    SELECT * 
      FROM LOAN
     WHERE RecTimeStamp BETWEEN DATE '2017-03-01' AND DATE '2017-03-31'
   QUALIFY row_number() OVER
             ( PARTITION BY LoanNumber 
                   ORDER BY RecTimeStamp DESC
             ) = 1;

INSERT INTO LOAN_APRIL_TMP
    SELECT * 
      FROM LOAN
     WHERE RecTimeStamp BETWEEN DATE '2017-04-01' AND DATE '2017-04-30'
   QUALIFY row_number() OVER
             ( PARTITION BY LoanNumber 
                   ORDER BY RecTimeStamp DESC
             ) = 1;

INSERT INTO LOAN_APRIL
    SELECT * 
      FROM LOAN_APRIL_TMP
   QUALIFY row_number() OVER
             ( PARTITION BY LoanNumber 
                   ORDER BY RecTimeStamp DESC
             ) = 1;

I just ran this sequentially, so they didn't execute in parallel. Today I'm going to experiment to see about getting each segment to load in parallel.

Also, to a larger point, I have trouble finding sufficiently technical documentation to figure these types of questions out. Is there a good resource for this? I understand there are a lot of propriety concerns, but there has to be something which describes, at least at a high level, the implementation of these functions.


Solution

  • There might be multiple reasons. You should check DBQL to see the actual resource usage differences.

    • The data in the 1st Select is scattered across more partitions than those smaller Selects.

    • Explain might show spool will not be chached in memory for the big Select, but not for the seperate ones.

    • VarChars in order by are expanded to Chars of the defined size, if LoanNumber is actually a VarChar(100) (I doubt it is) it will increase spool, too (but this is a common problem for other queries against that table).

    There's one disadvantage of OLAP functions, they need two spools, i.e. double the spool size. If this table has lots of column/large rows it might be much more efficient to run the ROW_NUMBER only against the PK of the table and then join back like this:

    CREATE SET TABLE LOAN_APRIL_TMP
      ( LoanNumber VARCHAR(100),
        RecTimeStamp TIMESTAMP(0)
      )
    PRIMARY INDEX (LoanNumber) -- same PPI as source table to facilitate fast join back
    PARTITION BY RANGE_N
      ( ROW_INS_TS BETWEEN 
            TIMESTAMP '2017-01-01 00:00:00+00:00' 
        AND TIMESTAMP '2017-12-31 23:59:59+00:00' 
        EACH INTERVAL '1' DAY 
      );
    
    INSERT INTO LOAN_APRIL_TMP
    SELECT LoanNumber, RecTimeStamp -- no other columns
    FROM LOAN
    WHERE RecTimeStamp <= DATE '2017-04-30'
    QUALIFY row_number() OVER
                 ( PARTITION BY LoanNumber 
                       ORDER BY RecTimeStamp DESC
                 ) = 1
    ;
    
    INSERT INTO LOAN_APRIL
    SELECT l.* -- now get all columns
    FROM LOAN AS l
    JOIN LOAN_APRIL_TMP AS AS tmp
      ON l.LoanNumber = tmp.LoanNumber
     AND l.RecTimeStamp = tmp.RecTimeStamp