Search code examples
sqlpostgresqlsql-updategreatest-n-per-group

Only update one row per group of peers


I have got problem with sub-query plan in PostgreSQL. It seems that in my query sub-query is evaluated only once at the beginning of execution. I have got a simple table with name column where I would like to execute a simple UPDATE.

Main task is to get rid of diacritic marks and spaces. However, multiple versions of the same word can exist. For example: 'poleć' with one space at the end and with two spaces or three spaces sometimes. I would like to change only one of them in case more version of this same words exists. When I run this query:

update newtable
set name = translate(trim(regexp_replace(name, '\s+', ' ', 'g')), 'ąćęłńóśźżĄĆĘŁŃÓŚŹŻ', 'acelnoszzACELNOSZZ' )
where translate(trim(regexp_replace(name, '\s+', ' ', 'g')), 'ąćęłńóśźżĄĆĘŁŃÓŚŹŻ', 'acelnoszzACELNOSZZ' )
not in (select sis.name from newtable as sis where sis."name" is not null)

all variants of this same word are changed, even though I try to avoid it in sub-query. Is there a way to re-evaluate a sub-query once per row.

Structure of a table is very simple.

ID NAME
1 poleć
2 poleć

Solution

  • Is there a way to re-evaluate a sub-query once per row.

    Yes, you can do that with a subquery expression in the assignment - a "correlated subquery". Like:

    UPDATE t SET col = (SELECT col FROM t2);
    

    But that's not going to solve your problem at all, as every re-evaluation still sees the same snapshot of the underlying tables from the start of the UPDATE command. You would have to update one row at a time, which is comparatively expensive.

    This does something like you ask for:

    UPDATE newtable t
    SET    name = u.new_name
    FROM  (
       SELECT DISTINCT ON (new_name) *  -- one per group, unchanged first
       FROM  (
          SELECT id, name
               , translate(regexp_replace(trim(name), '\s\s+', ' ', 'g')
                         , 'ąćęłńóśźżĄĆĘŁŃÓŚŹŻ'
                         , 'acelnoszzACELNOSZZ') AS new_name
          FROM   newtable
          ) sub
       ORDER  BY new_name, name <> new_name, id
       ) u
    WHERE u.id = t.id
    AND   u.name <> u.new_name;  -- only if actual change
    

    db<>fiddle here

    The subquery u picks one row from every set that would end up with the same name, using DISTINCT ON. See:

    Suspecting a UNIQUE constraint on name, I prioritize rows that don't change at all. That's done by name <> newname, where false sorts before true. See:

    Only the one row per group is updated - and only if it actually changes. This way, name stays UNIQUE - as long as there are no concurrent writes that might interfere.

    It's unclear if that's what you actually need, I filled in gaps in the task description with educated guesses.

    While being at it, I made the string expression a lot cheaper.