I have a large sourceTable (300 Million rows) with several columns:
id sequence orientation gc_content – etc
most sequences appear only once in the table, but others might be in there much more often
I need to know how often each sequence appears in the sourceTable, so I created a table targetTable with 3 columns:
id sequence frequency
The column sequence has been set to be unique.
I filled this targetTable with all distinct sequence values and set the frequency to 0 by default in each.
Now I want to select all the sequences from the sourceTable and "insert" them into the targetTable (which can't happen) and ON CONFLICT increase the frequency for that sequence by 1 (not sure that is the most intelligent way of going about it).
The code works fine for the first part
INSERT INTO targetTable (sequence, frequency)
(SELECT sequence, 1
FROM sourceTable)
ON CONFLICT (sequence)
DO NOTHING
but when I try
INSERT INTO targetTable (sequence, frequency)
(SELECT sequence, 1
FROM sourceTable)
ON CONFLICT (sequence)
--DO NOTHING
DO UPDATE SET
frequency = targetTable.frequency + 1
WHERE
sequence = targetTable.sequence -- = sourceTable.sequence
;
I get the message that sequence is ambiguous.
To make sequence not ambiguous I tried
WHERE
targetTable.sequence = sourceTable.sequence
and
WHERE
sourceTable.sequence = targetTable.sequence
but then I get "ERROR: missing FROM-clause entry for table "sourceTable" "
What am I missing here?
Thanks for any advice - I would have liked to figure that one out on my own as it seems like a common problem that should be easy to research on the internet, but I banged my head against that since yesterday morning and am running out of time …
The virtual table associated with failed INSERT
rows is named EXCLUDED
, so the SQL would be as follows, except that each row can only be affected once:
INSERT INTO targetTable (sequence, frequency)
SELECT sequence, 1
FROM sourceTable
ON CONFLICT (sequence)
DO UPDATE SET frequency = targetTable.frequency + 1
WHERE EXCLUDED.sequence = targetTable.sequence;
This is documented in PostgreSQL 15 Documention - INSERT SQL Command. Note, the SELECT
clause didn't need to be enclosed in parentheses. Avoid using sequence
as a column identifier since it is a SQL key word.
A proper approach to get the counts would be to use the COUNT
aggregate function grouped by sequence
:
INSERT INTO targetTable(sequence, frequency)
SELECT sequence, COUNT(*) AS frequency
FROM sourceTable
GROUP BY sequence;