Search code examples
postgresqlwhere-in

Filtering rows in WHERE clause with a textual set of ranges


I have a table with 10s of millions of rows. Various complex filtering queries produce row sets to support an application. These row sets are of of arbitrary size from a single row up to and including the full table. For domain-specific reasons, however, they always maintain high levels of contiguity along a particular key.

I need to pass these row sets bidirectionally between the database and the application, and it would be nice to compress this somehow. Many of you are probably familiar with UNIX cut, which takes a field specification like so: cut -f 2-6,7,9-21 and returns the corresponding columns. I am currently using a slightly limited version of the cut field specification (e.g. no 17-) to represent the row sets. So, for instance 24-923817,2827711-8471362,99188271 indicates a unique set of 6567445 rows while occupying 34 bytes.

I have already written the following procedures to convert these to SQL WHERE filters using the BETWEEN syntax

CREATE OR REPLACE FUNCTION cut_string_to_sql_filter( TEXT, TEXT ) RETURNS TEXT AS $$
SELECT
    CASE $1
        WHEN '' THEN 'FALSE'
        ELSE
            (SELECT
                '(' || STRING_AGG( REGEXP_REPLACE( REGEXP_REPLACE( str, '(\d+)-(\d+)', QUOTE_IDENT( $2 ) || ' BETWEEN \1 AND \2' ), '^(\d+)$', QUOTE_IDENT( $2 ) || '=\1' ), ' OR ' ) || ')' AS sql
                FROM
                    REGEXP_SPLIT_TO_TABLE( $1, ',' ) AS t(str))
        END;
$$ LANGUAGE SQL IMMUTABLE STRICT PARALLEL SAFE;

The first parameter is the row set specification and the second parameter is the key field name for the table. For the example above, SELECT cut_string_to_sql_filter( '24-923817,2827711-8471362,99188271', 'some_key' ) returns:

(some_key BETWEEN 24 AND 923817 OR some_key BETWEEN 2827711 AND 8471362 OR some_key=99188271)

The problem with this is that currently any query that makes use of such row set specifications must use dynamic SQL, because I cannot think of a way to use custom operators or any other syntactic features to embed this effect in a plain SQL query.

I have also written a set-returning function for the row specifications:

CREATE OR REPLACE FUNCTION cut_string_to_set( TEXT ) RETURNS SETOF INTEGER AS $$
DECLARE
    _i TEXT;
    _j TEXT;
    _pos INTEGER;
    _start INTEGER;
    _end INTEGER;
BEGIN
    IF $1 <> '' THEN
        FOR _i IN SELECT REGEXP_SPLIT_TO_TABLE( $1, ',' ) LOOP
            _pos := POSITION( '-' IN _i );
            IF _pos > 0 THEN
                _start := SUBSTRING( _i FROM 1 FOR _pos - 1 )::INTEGER;
                _end := SUBSTRING( _i FROM _pos + 1 )::INTEGER;
                FOR _j IN _start.._end LOOP
                    RETURN NEXT _j;
                END LOOP;
            ELSE
                RETURN NEXT _i;
            END IF;
        END LOOP;
    END IF;
END
$$ LANGUAGE PLPGSQL IMMUTABLE STRICT PARALLEL SAFE;

This works in plain SQL with WHERE some_key IN (SELECT cut_string_to_set(...)). It is, of course, comparatively inefficient in unpacking what is best expressed to the planner as a set of ranges, produces nightmarish and verbose query plans, and may or may not prevent the planner from using an index when it otherwise could and should.

Can anybody offer any solutions to the above conundrum for packaging this, potentially as its own type, potentially with custom operators, to allow syntactically sane index-based filtering on a column without dynamic SQL in the broader involved query? Is this simply impossible?

Feel free to offer suggestions for improvements to the procedures if you see any opportunities. And thanks!

EDIT 1

Great answer below suggests using an array of range types. Unfortunately, the query planner does not seem willing to use indexes with such a query. Planner output below from run on a small test table.

Gather  (cost=1000.00..34587.33 rows=38326 width=45) (actual time=0.395..112.334 rows=1018 loops=1)
Workers Planned: 6
Workers Launched: 6
->  Parallel Seq Scan on test  (cost=0.00..29754.73 rows=6388 width=45) (actual time=91.525..107.354 rows=145 loops=7)
        Filter: (test_ref <@ ANY ('{"[24,28)","[29,51)","[999,1991)"}'::int4range[]))
        Rows Removed by Filter: 366695
Planning time: 0.214 ms
Execution time: 116.779 ms

The CPU cost (notice 6 workers in parallel for over 100 ms on the small test table) is too high. I cannot see how any additional index could help here.

To contrast, here is the planner output using the BETWEEN filters.

Bitmap Heap Scan on test  (cost=22.37..1860.39 rows=1031 width=45) (actual time=0.134..0.430 rows=1018 loops=1)
Recheck Cond: (((test_ref >= 24) AND (test_ref <= 27)) OR ((test_ref >= 29) AND (test_ref <= 50)) OR ((test_ref >= 999) AND (test_ref <= 1990)))
Heap Blocks: exact=10
->  BitmapOr  (cost=22.37..22.37 rows=1031 width=0) (actual time=0.126..0.126 rows=0 loops=1)
        ->  Bitmap Index Scan on test_test_ref_index  (cost=0.00..2.46 rows=3 width=0) (actual time=0.010..0.010 rows=4 loops=1)
            Index Cond: ((test_ref >= 24) AND (test_ref <= 27))
        ->  Bitmap Index Scan on test_test_ref_index  (cost=0.00..2.64 rows=21 width=0) (actual time=0.004..0.004 rows=22 loops=1)
            Index Cond: ((test_ref >= 29) AND (test_ref <= 50))
        ->  Bitmap Index Scan on test_test_ref_index  (cost=0.00..16.50 rows=1007 width=0) (actual time=0.111..0.111 rows=992 loops=1)
            Index Cond: ((test_ref >= 999) AND (test_ref <= 1990))
Planning time: 0.389 ms
Execution time: 0.660 ms

END EDIT 1

EDIT 2

Answer below suggest using range index. The problem, as far as I understand this, is that I do not need to index a range type. All right, so maybe the key column is converted to a range for the operation, so I can apply a GIST index to it and the planner will use that.

CREATE INDEX test_test_ref_gist_index ON test USING GIST (test_ref);
ERROR:  data type integer has no default operator class for access method "gist"
HINT:  You must specify an operator class for the index or define a default operator class for the data type.

No surprise here. So let's convert the key column to a range and index that.

CREATE INDEX test_test_ref_gist_index ON test USING GIST (INT4RANGE( test_ref, test_ref ));

Whew, a 110 MB index. That's hefty. But does it work.

Gather  (cost=1000.00..34587.33 rows=38326 width=45) (actual time=0.419..111.009 rows=1018 loops=1)
Workers Planned: 6
Workers Launched: 6
->  Parallel Seq Scan on test_mv  (cost=0.00..29754.73 rows=6388 width=45) (actual time=90.229..105.866 rows=145 loops=7)
        Filter: (test_ref <@ ANY ('{"[24,28)","[29,51)","[999,1991)"}'::int4range[]))
        Rows Removed by Filter: 366695
Planning time: 0.237 ms
Execution time: 114.795 ms

Nope. I'm not too surprised. I would expect this index to work for "contains" rather than "contained by" operations. I have no experience here though.

END EDIT 2


Solution

  • Pass an array of ranges:

    select *
    from t
    where
        k <@ any (array[
            '[24,923817]','[2827711,8471362]','[99188271,99188271]'
        ]::int4range[])
    

    Check indexing for range types: https://www.postgresql.org/docs/current/static/rangetypes.html#RANGETYPES-INDEXING

    In case a suitable range index is not possible do a join to the materialized ranges:

    select *
    from
        t
        inner join
        (
            select generate_series(lower(a),upper(a) - 1) as k
            from unnest(array[ 
                '[24,27]','[29,50]','[999,1990]'
            ]::int4range[]) a(a)
        ) s using (k)
    

    It is possible to avoid joining all the range values. Compare to the lower and upper bounds of the range:

    select *
    from
        t
        cross join
        (
            select lower(a) as l, upper(a) - 1 as u
            from unnest(array[
                '[24,27]','[29,50]','[999,1990]'
            ]::int4range[]) a(a)
        ) s
    where k between l and u