Search code examples
postgresqldatabase-performance

Postgres queries intermittently running extremely slowly


We have some queries that are running extremely slowly intermittently in our production environment. These are JSONB intersection queries which normally return in milliseconds, but are taking 30-90 seconds.

We have tried to look at co-occurring server conditions such as RAM, CPU and query load, but there is nothing obvious. This affects a very small minority of queries - probably less than 1%. This does not appear to be a query optimization issue as the affected queries themselves are varied and in some cases very simple.

We've reproduced the same environment as far as possible on a staging server and loaded it heavily and the issue does not occur.

Can anyone suggest possible steps to investigate what is occurring in Postgres when this happens, or anything else we should consider? We have been working on this for over a week and are running out of ideas.


Solution

  • It is difficult to guess the cause of that problem; one explanation would be locks.

    You should use auto_explain to investigate the problem.

    In postgresql.conf, use the following settings:

    # log if somebody has to wait for a lock for more than one second
    log_lock_waits = on
    # log slow statements with their parameters
    log_min_duration_statement = 1000
    # log the plans of slow statements
    shared_preload_libraries = 'auto_explain'
    # configuration for auto_explain
    auto_explain.log_nested_statements = on
    auto_explain.log_min_duration = 1000
    

    Then restart PostgreSQL.

    Now all statements that exceed one second will have their plan dumped in the PostgreSQL log, so all you have to do is to wait for the problem to happen again, so that you can analyze it.

    You can also get EXPLAIN (ANALYZE, BUFFERS) output if you set

    auto_explain.log_buffers = on
    auto_explain.log_analyze = on
    

    That would make the log much more valuable, but it will slow down processing considerably, so I'd be reluctant to do it on a production system.