Search code examples
sqlpostgresqlsql-execution-planexplain

PostgreSQL EXPLAIN: How do I see a plan AS IF certain tables had millions of rows?


This is a question about PostgreSQL's EXPLAIN command. This command shows you how the optimizer will execute your SQL based on the data in your tables. We are not in prod yet, so all of our tables have ~100 rows or less. Is there a way to get EXPLAIN to tell me what the explain plan would look like if certain tables had millions of rows instead of tens of rows?

I could generate the data somehow, but then I'd have to clear it and wait for it to be created. If that's the only way, I'll accept that as an answer, though.


Solution

  • I don't think so. Postgresql collects some statistics related to the table that the optimizer will use to choose the best plan. These statistics are not related to simply how many rows a table contains but they will depends on the values/data too.

    From the postgres documentation:

    the query planner needs to estimate the number of rows retrieved by a query in order to make good choices of query plans.

    What does it mean that? Suppose we have an indexed column called foo, without a non-unique constraint. Suppose you have the following simple query:

    SELECT * FROM test_table WHERE foo = 5
    

    Postgresql will have to choose between different index scans:

    • sequential scan
    • index scan
    • bitmap scan

    It will choose the type of scan based on how many rows it thinks to retrieve from the query. How does it know how many rows will be retrieved before running the query? With the statistics that it collects. These statistics are based on the VALUES/DATA inside your table. Suppose you have a table with 1 million of rows and 90% of them have foo = 5. Postgresql may be know that, because it could have collected some statistics about the distribution of your data. So it will chose a sequential scan, because according to its cost model, this scan is the cheapest one.

    In the end, it will would be not enough generate data, but you should generate values that will represent the reality (the data that you will have in the future).

    You can already build your database creating some indexes (based on the query that you will do) to have already good performance in production. If it will be not enough you will have to tune your indexes after you go into production.