I have a process that runs every 5 minutes and tries to insert a batch of articles into a table. The articles come from web-scraping, so there are cases in which I am trying to insert a batch that contains articles which have already been saved into the DB.
My primary key is uuid
- an MD5 hash of the article title.
Checking if an article exists in the db to filter the batch is kinda inefficient.
Is it a DB level way in Postgresql to ignore the attempts of inserting a duplicate uuid
without returning an Error?
You could insert using the WHERE NOT EXISTS
clause.
For example, consider a test
table with a numeric id
as primary key and a textual name
.
db=> CREATE TABLE test(id BIGSERIAL PRIMARY KEY, name TEXT);
CREATE TABLE
-- Insertion will work - empty table
db=> INSERT INTO test(id, name)
SELECT 1, 'Partner number 1'
WHERE NOT EXISTS (SELECT 1,2 FROM test WHERE id=1);
INSERT 0 1
-- Insertion will NOT work - duplicate id
db=> INSERT INTO test(id, name)
SELECT 1, 'Partner number 1'
WHERE NOT EXISTS (SELECT 1,2 FROM test WHERE id=1);
INSERT 0 0
-- After two insertions, the table contains only one row
db=> SELECT * FROM test;
id | name
----+------------------
1 | Partner number 1
(1 row)
ON CONFLICT
Quoting the documentation:
ON CONFLICT
can be used to specify an alternative action to raising a unique constraint or exclusion constraint violation error.
The action can be DO NOTHING
, or a DO UPDATE
. The second approach is often referred to as Upsert - a portmanteau of Insert and Update.
Technically WHERE NOT EXISTS
is equivalent to ON CONFLICT DO NOTHING
. See the query plans for a deeper dive.