Search code examples
sqloracle-databaseindexingrelational-databasebinary-search

how to use index when searched column is not indexed but has the same ordering as the indexed primary key


I have a huge table (kind of audit log) with these columns:

ID, TS, DATA
  • ID is the primary key, and it is a number from a sequence.
  • TS is a timestamp and it is the current timestamp of the insert.
  • DATA is the useful data.

There is an index on the primary key (ID).

There is a guarantee that if record "B" has a greater ID than record "A" then the TS of "B" will be greater or equal to TS of "A".

My goal is to select the records for a given time interval. The time interval is always very short compared to the total time the table covers (days vs years).

There is no index on TS and I can not create one (explanation: The reason is that the audit table on the production system has a few thousand million rows. The DBAs fear that a new index will unnecesseraly slow down inserts into the log. Actually my query will not be a frequent one and will use only a small part of the table).

If I simply ask

select * 
from audit 
where TS > 20220101 and TS < 20220102

then a full table scan happens which takes way too much time.

If I find out first the first ID for each day, and get the knowledge that the first ID for 20220101 is 123456 and the first id for 20220102 is 145678 then I can ask

select * 
from audit 
where ID > 123456 and ID < 145678

which is quick because an index scan happens.

So it is obvious that instead of the full table scan I should somehow find out the first and last ID for the given time period and use them. IT is also obvious that I can find out the IDs quickly via binary search, because of the correlation between the IDs and the TSs. But I don't know how to do this in a SQL query, if it is possible at all.

So is it possible to make use of the ID index for this query? If yes, how?

Is it possible to somehow hint for the DB engine that there is a correlation between ID and TS ?


Solution

  • Below are some options to solve your problem:

    1. Build the index The DBA's fear of indexing is unfounded. Indexes exist precisely for these kind of problems. Indexes are built to grow, and in some ways they work better for large tables than they do for small tables. There's nothing unusual about multiple indexes on a billion row table. The table already has one index for the primary key - how much would a second index hurt?
    2. Schedule the query If a query isn't run often enough to justify an index, then it usually also doesn't matter if the query takes a long time to finish. If the query is only used once a week, can't you just wait an hour? If this query is part of some review you have to perform every day, and you don't want to wait an hour every morning, then automate the query with some combination of DBMS_SCHEDULER, a PL/SQL block to turn the results into an HTML table, and sending the results with UTL_MAIL.
    3. Parallelism If you can't work smarter, work harder. Use multiple threads to make the full table scan run more quickly. Just change your query to SELECT /*+ PARALLEL(X) */ ... where X is some reasonable number of threads. Be careful not to use too much parallelism. If your DBAs are scared of an index, they probably hate parallelism.
    4. Partition Partitioning creates multiple physical objects that Oracle can almost instantly switch between. This might work perfectly in your scenario, since your queries appear to be relatively coarse, and Oracle has a convenient interval partitioning approach that stores one day's worth of data in one physical table.
    alter table audit_table
    modify partition by range(ts) interval(numtodsinterval(1, 'DAY'))
    (
        --Pick the earlest day here:
        partition p1 values less than (date '2020-01-01')
    );
    
    1. Materialized zone map (If you have an Exadata system. I haven't tested this code or concept.) A materialized zone map on the TS column might work well as a pseudo index. The zone maps stores the min and max values for a group of blocks. Since your values are loaded in order, your table data might be ordered enough for those mins and maxes to help quickly identify the relevant blocks.
    create materialized zonemap audit_tabl_zm on audit_table(ts);
    
    1. Binary-search of related index Use one the existing answers, possibly as part of a function or CTE, or possibly you can use the IDs returned and then manually plug them into an existing query. While this works, it's not ideal because you shouldn't have to significantly change your queries to get good performance.

      Also, are you sure that the primary key and the timestamp are perfectly synchronized? Using sequences to have any meaning, even an order, can be dangerous. For example, if you have Real Application Clusters, and the sequence was not explicitly set to ORDER, each instance will have a separate cache and the IDs will not be in order.