Search code examples
postgresqlamazon-web-servicesamazon-rdsdatabase-performance

What is the bottleneck of my postgres bulk insert?


I am bulk-inserting some data into a postgres DB on RDS, and it's taking longer than I would like.

A migration to initialise the DB schema would look like this:

CREATE TABLE "addresses" ("address" TEXT NOT NULL);

CREATE UNIQUE INDEX "unique_address" ON "addresses"("address");

CREATE INDEX "autocomplete_index" ON "addresses" USING btree (lower(address) text_pattern_ops);

The data is coming from S3 where I have a collection of around 800 256MB CSV files. For each CSV file, I use the aws_s3.table_import_from_s3 function to copy the data into a temporary table. This part is very fast. Then I insert from the temporary table into my addresses table like this:

INSERT INTO addresses
SELECT * FROM temp_addresses
ON CONFLICT (address) DO NOTHING;

This INSERT takes about 90 minutes to import a single 256MB csv file.

From the performance insights page it seems like that bottleneck is IO. (This is what I infer from the bars here being dominated by "IO:DataFileRead").

Average Active Sessions

The DB instance is a db.t3.small with 2 vCPU and 2 GB RAM, 1024 GB of gp3 storage with 12000 provisioned IOPS and 500 MiBps throughput.

From what I can tell, I am far below the limit in terms of IO throughput:

IO throughput

...and I also seem to be well below the limit in terms of IOPS:

IOPS

...so I'm struggling to understand what the bottleneck is here. What am I missing?


Extra notes:

Here is a chart of the CPU usage during the load:

CPU Usage

And here's one of Freeable memory during the load:

Freeable Memory


Solution

  • Your bottleneck is reading index pages in order to update them for the new data. These reads are only asking for 8kB per read and are (presumably) randomly scattered. That means you can't max out the throughput, as doing that requires reads to be either large or sequential. You also can't max out IOPS, because doing that requires multiple IO requests to be in flight at the same time and a single PostgreSQL process does not use async IO/prefetching when doing index maintenance.

    AWS vaguely describes the gp3 latency as "single digit millisecond". If we assume that to mean 1 millisecond, then you would need to have at least 12 requests in flight at the same time to be able to approach the limit of 12k IOPS.

    You could increase your RAM so that more of the index can stay cached and not need to hit disk, but it doesn't seem plausible to increase them enough to handle 200GB of data. But almost all of the pages actually read from disk will get dirtied and will eventually have to be written back. The kernel might do a good job of absorbing those writes internally, and then issuing them to the underlying storage device asynchronously--but I wouldn't count on it working perfectly.

    You could try to reduce the latency of IO requests against gp3, but I have no idea how you would go about doing that, there doesn't seem to be any provisioning knobs which adjust that.

    You could try to insert many files simultaneously by launching many workers at at a time processing different files. This would be one way to get more async IO requests in flight. However, this would also just introduce a different bottlenecks, as now the multiple workers have to work hard to avoid corrupting each other's memory.

    You could drop indexes that are not needed (any non-unique ones) and re-add them only once all the loads are done. Or you could sort the data before/while loading, so that the data is inserted ordered by the index order. That way index maintenance would be directed to the same index leaf pages over and over again, and it would find the hot pages already in cache most of the time and so not need to read them from disk. You might need to combine these, dropping some indexes and ordering by the non-dropped ones.

    Based on you recent edit, that would look like:

    INSERT INTO addresses
    SELECT * FROM temp_addresses order by address
    ON CONFLICT (address) DO NOTHING;
    

    Depending on the pattern of capitalization within your data, ordering on "address" might provide a good-enough ordering for lower("address") that the benefit of cacheability would also carry over to that index as well.