Search code examples
postgresqlrelational-database

Index creation speed improvement


I have the following table in Postgres 14.7

create table product
(
    id          uuid,
    description varchar,
    created_at  timestamptz,
    primary key (id)
);

create index idx__product__created_at on product(created_at);

This table has hundreds of millions of records, and many more columns that were ommited for simplification. Now I want to make the following changes:

alter table product
    add column published_to_kafka timestamptz default null;

create index idx__product_published_to_kafka on product (created_at asc) where published_to_kafka is null;

I cannot afford performance issues on the database, not even for a short period of time, and I am worried that the index creation will make the database slow.

If most of the records on the database had the column published_to_kafka not null when the command for the index creation was executed, would that make the performance of the index creation faster?


Solution

  • I cannot afford performance issues on the database, not even for a short period of time, and I am worried that the index creation will make the database slow.

    I think you need to be clear about the difference between a "slow database" and a "locked table" -- a regular CREATE INDEX will lock the table until the index is built, temporarily preventing users from getting some data. While this can be experienced by users as a slow web page loading, it is really a short outage because data is inaccessible. If you are looking to prevent this kind of service disruption, you'll have to use CREATE INDEX CONCURRENTLY, which works around the locking behavior.

    If you are genuinely concerned about the performance of index builds, I think it is important to bear in mind that to create an index, every row must be visited, and the relevant column(s) need to be analyzed to create the index. We can't get around having to scan every row. But then again, without an index, you will likely have to scan every row during a SELECT (unless, of course, you are searching on a column with a unique constraint, and the first few rows returns what you're looking for).

    As the index is built, the contents of the index will live in memory (because things need to be sorted) until it is ready to be persisted to disk. Therefore, if maintenance_work_mem is not large enough, some sorting operations will be done on-disk instead of in-memory, and thereby would slow down the CREATE INDEX operation.

    As others have mentioned in the comments, if you're concerned about not being able to "afford performance issues on the database, not even for a short period of time," you need to re-evaluate whether your current hardware is sufficient for your needs. If SELECT * FROM <table_to_be_indexed> slows down your database, or if VACUUM <table_to_be_indexed> slows down your database, you may have hardware ill-suited for your needs.

    If most of the records on the database had the column published_to_kafka not null when the command for the index creation was executed, would that make the performance of the index creation faster?

    Possibly -- you still have to hit every row, but sorting might be faster, since there are less values to sort.