sql regex postgresql sql-update postgresql-10

How to use regexp_matches() in an UPDATE statement?

I am trying to clean up a table that has a very messy varchar column, with entries of the sorts:

<u><font color="#0000FF"><a href="http://virginialidar.com/index-3.html#.VgLbFPm6e73" target="_blank">VA Lidar</a></font></u> OR <u><font color="#0000FF"><a href="https://inport.nmfs.noaa.gov/inport/item/50122" target="_blank">InPort Metadata</a></font></u>

I would like to update the column by keeping only the html links, and separating them with a coma if there are more than one. Ideally I would do something like this:

UPDATE mytable
SET column = array_to_string(regexp_matches(column,'(?<=href=").+?(?=\")','g') , ',');

But unfortunately this returns an error in Postgres 10:

ERROR: set-returning functions are not allowed in UPDATE

I assume regexp_matches() is the said set-returning function. Any ideas on how I can achieve this?

Solution

Notes

1.
You don't need to base the correlated subquery on a separate instance of the base table (like other answers suggested). That would be doing more work for nothing.

2.
For simple cases an ARRAY constructor is cheaper than array_agg(). See:

Why is array_agg() slower than the non-aggregate ARRAY() constructor?

3.
I use a regular expression without lookahead and lookbehind constraints and parentheses instead: href="([^"]+)

See query 1.

This works because parenthesized subexpressions are captured by regexp_matches() (and several other Postgres regexp functions). So we can replace the more sophisticated constraints with plain parentheses. The manual on regexp_match():

If a match is found, and the pattern contains no parenthesized subexpressions, then the result is a single-element text array containing the substring matching the whole pattern. If a match is found, and the *pattern* contains parenthesized subexpressions, then the result is a text array whose n'th element is the substring matching the n'th parenthesized subexpression of the pattern

And for regexp_matches():

This function returns no rows if there is no match, one row if there is a match and the g flag is not given, or N rows if there are N matches and the g flag is given. Each returned row is a text array containing the whole matched substring or the substrings matching parenthesized subexpressions of the pattern, just as described above for regexp_match.

4.
regexp_matches() returns a set of arrays (setof text[]) for a reason: not only can a regular expression match several times in a single string (hence the set), it can also produce multiple strings for each single match with multiple capturing parentheses (hence the array). Does not occur with this regexp, every array in the result holds a single element. But future readers shall not be lead into a trap:

When feeding the resulting 1-D arrays to array_agg() (or an ARRAY constructor) that produces a 2-D array - which is only even possible since Postgres 9.5 added a variant of array_agg() accepting array input. See:

Is there something like a zip() function in PostgreSQL that combines two arrays?

However, quoting the manual:

inputs must all have same dimensionality, and cannot be empty or NULL

I think this can never fail as the same regexp always produces the same number of array elements. Ours always produces one element. But that may be different with other regexp. If so, there are various options:

Only take the first element with (regexp_matches(...))[1]. See query 2.
Unnest arrays and use string_agg() on base elements. See query 3.

Each approach works here, too.

Query 1

UPDATE tbl t
SET    col = (
   SELECT array_to_string(ARRAY(SELECT regexp_matches(col, 'href="([^"]+)', 'g')), ',')
   );

Columns with no match are set to '' (empty string).

Query 2

UPDATE tbl
SET    col = (
   SELECT string_agg(t.arr[1], ',')
   FROM   regexp_matches(col, 'href="([^"]+)', 'g') t(arr)
   );

Columns with no match are set to NULL.

Query 3

UPDATE tbl
SET    col = (
   SELECT string_agg(elem, ',')
   FROM   regexp_matches(col, 'href="([^"]+)', 'g') t(arr)
        , unnest(t.arr) elem
   );

Columns with no match are set to NULL.

db<>fiddle here (with extended test case)