I am querying a table containing billions of rows (BIG_TABLE
) in Oracle. The table is partitionned on a monthly basis (DTE
is the month of flux). The DBA wants us to filter our queries by month of flux. The DBA warned that (i) not doing so may cause slower queries and (ii) it may impact the others users so he may kill the non-filtered queries.
Since I need several months (sometimes up to 100), I use a parametrized query to pass the months I need (more below). I collect the results for all months in an intermediate table (INTERMEDIATE_TABLE
, containing about 200 million rows per month) and then I aggregate the data (FINAL_TABLE
, used for the analyses). The sum must be done by CHR
whatever the month of flux.
-- query1
CREATE TABLE
INTERMEDIATE_TABLE (
CHR VARCHAR2(255),
NBR NUMBER,
DTE DATE
);
-- query2
INSERT INTO
INTERMEDIATE_TABLE
SELECT
CHR,
NBR,
DTE
FROM
BIG_TABLE
WHERE
DTE = TO_DATE(?, 'YYYY-MM-DD');
-- query3
CREATE TABLE
FINAL_TABLE AS
SELECT
CHR,
SUM(NBR) AS NBR
FROM
INTERMEDIATE_TABLE
GROUP BY
CHR;
After saving the queries as strings (query1
, query2
, query3
), I use R's DBI to run the queries. dbExecute()
passes the dates one by one to the equal comparison in query2
, so the query is executed 3 times (i.e. the list is not passed in one go).
library(DBI)
dbConnect(odbc::odbc(), ...)
dbExecute(con, query1)
dbExecute(con, query2, params = list(c("2020-01-01", "2020-02-01", "2020-03-01")))
dbExecute(con, query3)
I would like INTERMEDIATE_TABLE
to be temporary. Can I INSERT INTO
a CTE?
I am open to alternative solutions but I am looking for an ANSI solution because a move to PostgreSQL is planned. I would like to avoid temporary table because only the data is temporary in Oracle, not the table. Of course I could just DROP
the table afterwards, but it seems more elegant/explicit (and efficient?) to make it temporary in the first place.
A reproducible example:
CREATE TABLE
BIG_TABLE (
CHR VARCHAR2(255),
NBR NUMBER,
DTE DATE
);
INSERT ALL
INTO BIG_TABLE VALUES ('A', 2, DATE '2020-01-01')
INTO BIG_TABLE VALUES ('B', 3, DATE '2020-01-01')
INTO BIG_TABLE VALUES ('A', 1, DATE '2020-02-01')
INTO BIG_TABLE VALUES ('B', 2, DATE '2020-02-01')
INTO BIG_TABLE VALUES ('A', 3, DATE '2020-02-01')
INTO BIG_TABLE VALUES ('B', 2, DATE '2020-03-01')
INTO BIG_TABLE VALUES ('B', 4, DATE '2020-03-01')
INTO BIG_TABLE VALUES ('C', 1, DATE '2020-03-01')
INTO BIG_TABLE VALUES ('B', 4, DATE '2020-04-01')
INTO BIG_TABLE VALUES ('D', 2, DATE '2020-05-01')
SELECT 1 FROM DUAL;
Desired output:
CHR NBR
A 6
B 11
C 1
You don't need any kind of intermediate stage at all, not even a subquery.
CREATE TABLE FINAL_TABLE AS
SELECT CHR,
SUM(NBR) NBR
FROM BIG_TABLE
WHERE DTE = ?
GROUP BY CHR;
Of course, with a billion rows, you want parallelism, which you may or may not be getting depending on how the table is configured. It might help to explicitly request it:
CREATE TABLE FINAL_TABLE PARALLEL (DEGREE 16) NOLOGGING AS
SELECT CHR,
SUM(NBR) NBR
FROM BIG_TABLE
WHERE DTE = ?
GROUP BY CHR;
If you need multiple months, you have several options. I suggest a single query in a manner that maximizes parallel thread distribution by partition, if this is a partitioned table:
CREATE TABLE final_table PARALLEL (DEGREE 16) NOLOGGING AS
SELECT chr,
SUM(nbr) nbr
FROM (SELECT chr,
SUM(nbr) nbr
FROM big_table
WHERE dte = ?
GROUP BY chr
UNION ALL
SELECT chr,
SUM(nbr) nbr
FROM big_table
WHERE dte = ?
GROUP BY chr
UNION ALL
SELECT chr,
SUM(nbr) nbr
FROM big_table
WHERE dte = ?
GROUP BY chr)
GROUP BY chr
If you need to pull many months and have them all end up in this table while only pulling one month at a time, then precreate the table and insert each month:
CREATE TABLE final_table(dte date,chr varchar2(18),nbr integer); -- one time
INSERT /*+ append */ into final_table
SELECT /*+ parallel(16) */
dte,
chr,
SUM(nbr) nbr
FROM big_table
WHERE dte = ?
GROUP BY dte,
chr;
COMMIT;
-- repeat for every month.
Then at the end if you need the whole aggregated across time, you can do a group by to exclude the date:
SELECT chr,
SUM(nbr) nbr
FROM final_table
GROUP BY chr