I have a table with about 20 partitions. Each partition has about 190 million records. I need to gather statistics on the table periodically during my processing, which I do using the DBMS_STATS.GATHER_TABLE_STATS command. When the table only had 1 partition, it used to take about 4 minutes to complete. Over time, the number of partitions has grown, and the time it takes to gather_table_stats has grown as well. I then switched to gathering stats on just the partition by adding a PARTNAME parameter to the GATHER_TABLE_STATS command, but the time it takes hasn't dropped. I have even created a new partition with only 1,000 rows in it, and when I gather stats on that partition, it still takes between 22 and 25 minutes to complete. I looked in the USER_TAB_PARTITIONS table, and I see that the LAST_ANALYZED column is only being updated for the partition I specified in the GATHER_TABLE_STATS, so I'm believe that the stats are only being gathered on my single partition, but why does it take so long? If it helps, this is my DDL. Note that I'm creating a LOCAL index. There are no other table that have foreign key references into this table.
CREATE TABLE LAR_ALLOCATION_PER_PART (
PROC_MONTH DATE NOT NULL,
COUNTRY_CODE VARCHAR2(2) NOT NULL,
PART_NUMBER VARCHAR2(20),
CUSTOMER_CODE VARCHAR2(32),
LAR_ID NUMBER NOT NULL,
GROSS_SALES_AMOUNT NUMBER,
ALLOCATION_AMOUNT NUMBER,
WARRANTY_AMOUNT NUMBER,
CURRENCY_CODE VARCHAR2(5),
CONSTRAINT LAR_ALLOC_PP_COUNTRY_CODE_FK FOREIGN KEY (COUNTRY_CODE) REFERENCES SUPPORTED_COUNTRY (COUNTRY_CODE),
CONSTRAINT LAR_ALLOC_PP_PART_NUM_FK FOREIGN KEY (PART_NUMBER) REFERENCES PART_CLASSIFICATION (ODS_PART_NUMBER),
CONSTRAINT LAR_ALLOC_PP_LAR_ID_FK FOREIGN KEY (LAR_ID) REFERENCES LEDGER_ALLOCATION_RULE (ID)
)
PARTITION BY RANGE(PROC_MONTH)
INTERVAL(NUMTOYMINTERVAL(1,'MONTH'))
(
PARTITION prior2017 VALUES LESS THAN (TO_DATE('01-JAN-2017', 'DD-MON-YYYY'))
);
CREATE INDEX LAR_ALLOCATION_PER_PART_IDX
ON LAR_ALLOCATION_PER_PART
(COUNTRY_CODE, LAR_ID, CUSTOMER_CODE, PART_NUMBER) LOCAL;
And this is the command I use to gather statistics:
BEGIN
DBMS_STATS.GATHER_TABLE_STATS(OWNNAME => 'MY_SCHEMA',
TABNAME => 'LAR_ALLOCATION_PER_PART',
PARTNAME => 'SYS_P40553', --Jan 2020: 1,000 records
OPTIONS => 'GATHER AUTO',
DEGREE => DBMS_STATS.DEFAULT_DEGREE,
CASCADE => TRUE);
END;
I've tried setting CASCADE to FALSE, adjusting the DEGREE to 32, even setting an ESTIMATE_PERCENT value to 10, but nothing have a statistically significant impact on run time.
I just saw an answer about using incremental statistics, and I'm going to try that out, but I'd like to understand why stat gathering on a partition takes so long, and if I'm doing something incorrectly.
You might want to look at the GRANULARITY
parameter of DBMS_STATS.GATHER_TABLE_STATS
.
By default, statistics are gathered for the partition as well as for the global table. Changing the value to APPROX_GLOBAL AND PARTITION
may avoid re-gathering statistics for the global table.
INCREMENTAL
is definitely a good idea and should be able to significantly help. Although there are some limitations to incremental statistics. For example, you must gather with the default ESTIMATE_PERCENT
.
The reason why gathering statistics for a single partition is slow is a long story.
It starts with the optimizer needing to to know both the number of values and the number of distinct values. The number of distinct values is often more useful. For example, if we query select * from employee where employee_id = 1
, Oracle can look at the distinctness of EMPLOYEE_ID
, determine that the predicate returns one row, and an index would be perfect. On the other hand, if we query select * from employee where status = 'terminated'
, Oracle can look at the distinctness of STATUS
, determine the predicate returns many rows, and a full table scan is a better match.
Finding the number of distinct values is more complicated than finding the total number of values. A naive algorithm would sort or hash the values, but that requires a lot of time and space. Instead, Oracle can use an algorithm like HyperLogLog to estimate the values based on a single pass of the table. This is why you need to keep the ESTIMATE_PERCENT
to the default value - it's faster to scan the whole table than to sort 10% of the table.
But finding the number of distinct values gets more complicated with partitioning. Oracle needs to know the number of distinct values per partition and for the whole table. Adding a small number of rows to a single partition could significantly change the results for the entire table, which is why Oracle by default has to re-scan the entire table.
For example, think of the birthday problem. Imagine there's a partition for groups of people, with a BIRTHDAY
column. If partition A has 15 distinct birthdays, and partition B has 15 distinct birthdays, how many distinct birthdays does the entire table have? Probably less than 30.
Incremental statistics solves that problem by creating a synopsis for each partition. Those synopses can be quickly combined together to re-estimate the global number of distinct values without re-scanning each partition. It only requires storing a little bit of extra data for each partition.