I have a (very large) table like this
CREATE SET TABLE LOAN
( LoanNumber VARCHAR(100),
LoanBalance DECIMAL(18,4),
RecTimeStamp TIMESTAMP(0)
)
PRIMARY INDEX (LoanNumber)
PARTITION BY RANGE_N
( ROW_INS_TS BETWEEN
TIMESTAMP '2017-01-01 00:00:00+00:00'
AND TIMESTAMP '2017-12-31 23:59:59+00:00'
EACH INTERVAL '1' DAY
);
This table typically gets rolled up by snapshots, for example the April month end snapshot would be
-- Pretend there is just 2017 data there
CREATE SET TABLE LOAN_APRIL AS
( SELECT *
FROM LOAN
WHERE RecTimeStamp <= DATE '2017-04-30'
QUALIFY row_number() OVER
( PARTITION BY LoanNumber
ORDER BY RecTimeStamp DESC
) = 1
)
PRIMARY INDEX (LoanNumber);
Which typically takes quite awhile to run. I was experimenting yesterday though and found that I had very good execution time by breaking it apart like this
CREATE SET TABLE LOAN_APRIL_TMP
( LoanNumber VARCHAR(100),
LoanBalance DECIMAL(18,4),
RecTimeStamp TIMESTAMP(0)
)
PRIMARY INDEX (LoanNumber);
CREATE SET TABLE LOAN_APRIL
( LoanNumber VARCHAR(100),
LoanBalance DECIMAL(18,4),
RecTimeStamp TIMESTAMP(0)
)
PRIMARY INDEX (LoanNumber);
INSERT INTO LOAN_APRIL_TMP
SELECT *
FROM LOAN
WHERE RecTimeStamp BETWEEN DATE '2017-01-01' AND DATE '2017-01-31'
QUALIFY row_number() OVER
( PARTITION BY LoanNumber
ORDER BY RecTimeStamp DESC
) = 1;
INSERT INTO LOAN_APRIL_TMP
SELECT *
FROM LOAN
WHERE RecTimeStamp BETWEEN DATE '2017-02-01' AND DATE '2017-02-28'
QUALIFY row_number() OVER
( PARTITION BY LoanNumber
ORDER BY RecTimeStamp DESC
) = 1;
INSERT INTO LOAN_APRIL_TMP
SELECT *
FROM LOAN
WHERE RecTimeStamp BETWEEN DATE '2017-03-01' AND DATE '2017-03-31'
QUALIFY row_number() OVER
( PARTITION BY LoanNumber
ORDER BY RecTimeStamp DESC
) = 1;
INSERT INTO LOAN_APRIL_TMP
SELECT *
FROM LOAN
WHERE RecTimeStamp BETWEEN DATE '2017-04-01' AND DATE '2017-04-30'
QUALIFY row_number() OVER
( PARTITION BY LoanNumber
ORDER BY RecTimeStamp DESC
) = 1;
INSERT INTO LOAN_APRIL
SELECT *
FROM LOAN_APRIL_TMP
QUALIFY row_number() OVER
( PARTITION BY LoanNumber
ORDER BY RecTimeStamp DESC
) = 1;
I just ran this sequentially, so they didn't execute in parallel. Today I'm going to experiment to see about getting each segment to load in parallel.
Also, to a larger point, I have trouble finding sufficiently technical documentation to figure these types of questions out. Is there a good resource for this? I understand there are a lot of propriety concerns, but there has to be something which describes, at least at a high level, the implementation of these functions.
There might be multiple reasons. You should check DBQL to see the actual resource usage differences.
The data in the 1st Select is scattered across more partitions than those smaller Selects.
Explain might show spool will not be chached in memory for the big Select, but not for the seperate ones.
VarChars in order by
are expanded to Chars of the defined size, if LoanNumber
is actually a VarChar(100)
(I doubt it is) it will increase spool, too (but this is a common problem for other queries against that table).
There's one disadvantage of OLAP functions, they need two spools, i.e. double the spool size. If this table has lots of column/large rows it might be much more efficient to run the ROW_NUMBER only against the PK of the table and then join back like this:
CREATE SET TABLE LOAN_APRIL_TMP
( LoanNumber VARCHAR(100),
RecTimeStamp TIMESTAMP(0)
)
PRIMARY INDEX (LoanNumber) -- same PPI as source table to facilitate fast join back
PARTITION BY RANGE_N
( ROW_INS_TS BETWEEN
TIMESTAMP '2017-01-01 00:00:00+00:00'
AND TIMESTAMP '2017-12-31 23:59:59+00:00'
EACH INTERVAL '1' DAY
);
INSERT INTO LOAN_APRIL_TMP
SELECT LoanNumber, RecTimeStamp -- no other columns
FROM LOAN
WHERE RecTimeStamp <= DATE '2017-04-30'
QUALIFY row_number() OVER
( PARTITION BY LoanNumber
ORDER BY RecTimeStamp DESC
) = 1
;
INSERT INTO LOAN_APRIL
SELECT l.* -- now get all columns
FROM LOAN AS l
JOIN LOAN_APRIL_TMP AS AS tmp
ON l.LoanNumber = tmp.LoanNumber
AND l.RecTimeStamp = tmp.RecTimeStamp