I have got problem with sub-query plan in PostgreSQL. It seems that in my query sub-query is evaluated only once at the beginning of execution. I have got a simple table with name column where I would like to execute a simple UPDATE
.
Main task is to get rid of diacritic marks and spaces. However, multiple versions of the same word can exist. For example: 'poleć' with one space at the end and with two spaces or three spaces sometimes. I would like to change only one of them in case more version of this same words exists. When I run this query:
update newtable
set name = translate(trim(regexp_replace(name, '\s+', ' ', 'g')), 'ąćęłńóśźżĄĆĘŁŃÓŚŹŻ', 'acelnoszzACELNOSZZ' )
where translate(trim(regexp_replace(name, '\s+', ' ', 'g')), 'ąćęłńóśźżĄĆĘŁŃÓŚŹŻ', 'acelnoszzACELNOSZZ' )
not in (select sis.name from newtable as sis where sis."name" is not null)
all variants of this same word are changed, even though I try to avoid it in sub-query. Is there a way to re-evaluate a sub-query once per row.
Structure of a table is very simple.
ID | NAME |
---|---|
1 | poleć |
2 | poleć |
Is there a way to re-evaluate a sub-query once per row.
Yes, you can do that with a subquery expression in the assignment - a "correlated subquery". Like:
UPDATE t SET col = (SELECT col FROM t2);
But that's not going to solve your problem at all, as every re-evaluation still sees the same snapshot of the underlying tables from the start of the UPDATE
command. You would have to update one row at a time, which is comparatively expensive.
This does something like you ask for:
UPDATE newtable t
SET name = u.new_name
FROM (
SELECT DISTINCT ON (new_name) * -- one per group, unchanged first
FROM (
SELECT id, name
, translate(regexp_replace(trim(name), '\s\s+', ' ', 'g')
, 'ąćęłńóśźżĄĆĘŁŃÓŚŹŻ'
, 'acelnoszzACELNOSZZ') AS new_name
FROM newtable
) sub
ORDER BY new_name, name <> new_name, id
) u
WHERE u.id = t.id
AND u.name <> u.new_name; -- only if actual change
db<>fiddle here
The subquery u
picks one row from every set that would end up with the same name, using DISTINCT ON
. See:
Suspecting a UNIQUE
constraint on name
, I prioritize rows that don't change at all. That's done by name <> newname
, where false
sorts before true
. See:
Only the one row per group is updated - and only if it actually changes. This way, name
stays UNIQUE
- as long as there are no concurrent writes that might interfere.
It's unclear if that's what you actually need, I filled in gaps in the task description with educated guesses.
While being at it, I made the string expression a lot cheaper.