GitLab encountering issues with PostgreSQL after OS upgrade to RHEL 7.6

We recently upgraded the OS:

$ cat /etc/redhat-release
Red Hat Enterprise Linux Server release 7.6 (Maipo)

After upgrading, we are facing lot of issues with GitLab (predominantly with Postgres)..

Our GitLab is dockerized i.e. GitLab (and all its internal services including PostgreSQL) is running inside a single container. The container does not have it's own glibc, so it is using the one from the OS.

ERROR: canceling statement due to statement timeout

STATEMENT:
SELECT relnamespace::regnamespace as schemaname, relname as relname, pg_total_relation_size(oid) bytes FROM pg_class WHERE relkind = 'r';

The timeout messages appear continuously and this results in users facing 502 errors when accessing GitLab.

I checked the statement timeout set on the database.

gitlabhq_production=# show statement_timeout;
 statement_timeout
-------------------
 1min
(1 row)

I don't know what to make of this. This is probably the default setting. Is this an issue with postgres? What does this mean? Anything I can do to fix this?

EDIT:

Checked pg_stat_activity and don't see any locks as the server was rebooted earlier. The same query is running fine now but we keep seeing this issue intermittently.

Ran \d pg_class to check whether the table uses any indexes and also to check the string column.

gitlabhq_production=# \d pg_class
         Table "pg_catalog.pg_class"
       Column        |   Type    | Modifiers
---------------------+-----------+-----------
 relname             | name      | not null
 relnamespace        | oid       | not null
 reltype             | oid       | not null
 reloftype           | oid       | not null
 relowner            | oid       | not null
 relam               | oid       | not null
 relfilenode         | oid       | not null
 reltablespace       | oid       | not null
 relpages            | integer   | not null
 reltuples           | real      | not null
 relallvisible       | integer   | not null
 reltoastrelid       | oid       | not null
 relhasindex         | boolean   | not null
 relisshared         | boolean   | not null
 relpersistence      | "char"    | not null
 relkind             | "char"    | not null
 relnatts            | smallint  | not null
 relchecks           | smallint  | not null
 relhasoids          | boolean   | not null
 relhaspkey          | boolean   | not null
 relhasrules         | boolean   | not null
 relhastriggers      | boolean   | not null
 relhassubclass      | boolean   | not null
 relrowsecurity      | boolean   | not null
 relforcerowsecurity | boolean   | not null
 relispopulated      | boolean   | not null
 relreplident        | "char"    | not null
 relfrozenxid        | xid       | not null
 relminmxid          | xid       | not null
 relacl              | aclitem[] |
 reloptions          | text[]    |
Indexes:
    "pg_class_oid_index" UNIQUE, btree (oid)
    "pg_class_relname_nsp_index" UNIQUE, btree (relname, relnamespace)
    "pg_class_tblspc_relfilenode_index" btree (reltablespace, relfilenode)

Would reindexing all tables and possibly alter tables help?

Solution

You should check whether the query us running for a minute or whether it is blocked behind a database lock. This can be seen from the pg_stat_activity row for the backend, which will show if the query is waiting for a lock or not (state=active and wait_event_type and wait_event indicate a lock).

If it is a lock, get rid of the locking transaction. It may be a prepared transaction, so check for these too.

If there is no lock at fault, it could be that your indexes have become corrupted by the operating system upgrade:

Since PostgreSQL uses operating system collations, database indexes on strings are sorted in collation order and an operating system upgrade can (and often does) lead to changed collations due to bug fixes in the C library, you should rebuild all indexes on string columns after such an upgrade.

The statement that you are showing does not use an index scan, so it should not be affected, but other statements may be.

Also, if you are using Docker, it may be that your container uses its own glibc that was not upgraded, and then you are not affected.