Search code examples
postgresqlcountdatabase-performancepostgresql-performance

Optimization of count query for PostgreSQL


I have a table in postgresql that contains an array which is updated constantly.

In my application i need to get the number of rows for which a specific parameter is not present in that array column. My query looks like this:

select count(id) 
from table 
where not (ARRAY['parameter value'] <@ table.array_column)

But when increasing the amount of rows and the amount of executions of that query (several times per second, possibly hundreds or thousands) the performance decreses a lot, it seems to me that the counting in postgresql might have a linear order of execution (I’m not completely sure of this).

Basically my question is:

Is there an existing pattern I’m not aware of that applies to this situation? what would be the best approach for this?

Any suggestion you could give me would be really appreciated.


Solution

  • Is there an existing pattern I’m not aware of that applies to this situation? what would be the best approach for this?

    Your best bet in this situation might be to normalize your schema. Split the array out into a table. Add a b-tree index on the table of properties, or order the primary key so it's efficiently searchable by property_id.

    CREATE TABLE demo( id integer primary key );
    INSERT INTO demo (id) SELECT id FROM arrtable;
    CREATE TABLE properties (
      demo_id integer not null references demo(id),
      property integer not null,
      primary key (demo_id, property)
    );
    CREATE INDEX properties_property_idx ON properties(property);
    

    You can then query the properties:

    SELECT count(id) 
    FROM demo 
    WHERE NOT EXISTS (
      SELECT 1 FROM properties WHERE demo.id = properties.demo_id AND property = 1
    )
    

    I expected this to be a lot faster than the original query, but it's actually much the same with the same sample data; it runs in the same 2s to 3s range as your original query. It's the same issue where searching for what is not there is much slower than searching for what is there; if we're looking for rows containing a property we can avoid the seqscan of demo and just scan properties for matching IDs directly.

    Again, a seq scan on the array-containing table does the job just as well.