I have to read a CSV every 20 seconds. Each CSV contains min. of 500 to max. 60000 lines. I have to insert the data in a Postgres table, but before that I need to check if the items have already been inserted, because there is a high probability of getting duplicate item. The field to check for uniqueness is also indexed.
So, I read the file in chunks and use the IN clause to get the items already in the database.
Is there a better way of doing it?
This should perform well:
CREATE TEMP TABLE tmp AS SELECT * FROM tbl LIMIT 0 -- copy layout, but no data
COPY tmp FROM '/absolute/path/to/file' FORMAT csv;
INSERT INTO tbl
SELECT tmp.*
FROM tmp
LEFT JOIN tbl USING (tbl_id)
WHERE tbl.tbl_id IS NULL;
DROP TABLE tmp; -- else dropped at end of session automatically
Closely related to this answer.