Search code examples
databasepostgresqlindexingdatabase-partitioning

Best practice for massive postgres tables


I have a table with 3 fields (username, target_value, score) generated externally by a full cross of username's (~400,000) and target_value's (~4000) and a calculated score, leading to a total row count of ~1.6 billion.

All my query's on this table will be in the format of

SELECT *
FROM _table
WHERE target_values IN (123, 456)

My initial version of this included a BTREE index on target_values but i ended up spending 45 minutes on a BITMAP HEAP SCAN of the index. I've also been looking at BRIN indexes, partitions and table clustering but as it takes hours to apply each approach to the table i cant exactly brute force each option and test for performance.

What are some recommendations for dealing with a single massive table with very 'blocky' data in Postgres 10?


Solution

  • If the table is a cross join of two data sets, why don't you store the individual tables and calculate the join as you need it? Databases are good at that.

    From your description I would expect a performance gain if you ran CLUSTER on the table to physically rewrite it in index order. Then you would have to access way fewer table blocks.

    Unfortunately CLUSTER will take long, makes the table unavailable and has to be repeated regularly.

    An alternative that may be better is to partition the table by target_value. 4000 partitions are a bit much, so maybe use list partitioning to bundle several together into one partition.

    This will allow your queries to perform fast sequential scans on only a few partitions. It will also make autovacuum's job easier.

    The bottom line, however, is that if you select a lot of rows from a table, it will always take a long time.