Running PostgreSQL 9.2.8 here...
I've been trying to obtain a list of all rows in a table that have non-ASCII characters in the address - anything outside the range of <space>
to ~
, as well as the backtick character `
. If any of the rows contain any invalid characters, the row with all address values would display. But for some reason I'm getting the following error:
ERROR: argument of OR must not return a set (10586)
LINE 9: (CAST(regexp_matches(a.address_line_1,'([^ !-~]|`)') AS VARCHAR)...
^
********** Error **********
ERROR: argument of OR must not return a set (10586)
SQL state: 42804
Character: 252
The query I've been trying to use is as follows:
select a.address_id, a.address_line_1,
a.address_line_2,
a.address_line_3,
regexp_matches(a.address_line_1,'([^ !-~]|`)'),
regexp_matches(a.address_line_2,'([^ !-~]|`)'),
regexp_matches(a.address_line_3,'([^ !-~]|`)')
FROM public.address a
WHERE
(CAST(regexp_matches(a.address_line_1,'([^ !-~]|`)') AS VARCHAR) <> '') OR
(CAST(regexp_matches(a.address_line_2,'([^ !-~]|`)') AS VARCHAR) <> '') OR
(CAST(regexp_matches(a.address_line_3,'([^ !-~]|`)') AS VARCHAR) <> '')
LIMIT 1000
I'm not sure what I could be missing as this seems like a valid query.
I am trying to get rows where there is an invalid character in ANY of the three address fields, not just an invalid character in all three.
regexp_matches() returns SETOF text
and cannot be used like you tried (as the error message tells you). You could use the regular expression operator ~
instead.
But your regular expression does not seem to cover what you describe as:
non-ASCII characters in the address
Also, the range !-~
in your bracket expression [^ !-~]
depends on your COLLATION
setting. The manual warns:
Ranges are very collating-sequence-dependent, so portable programs should avoid relying on them.
Consider:
SELECT g, chr(g), chr(g) ~ '([^ !-~]|`)'
FROM generate_series (1,300) g; -- ASCII range plus some
Assuming server encoding UTF8, to find rows with any non-ASCII characters in the 3 columns:
...
WHERE octet_length(concat(a.address_line_1, a.address_line_2, a.address_line_3))
<> length(concat(a.address_line_1, a.address_line_2, a.address_line_3))
This works because all non-ASCII characters are encoded with more than 1 byte in UTF8, hence octet_length()
reports a higher number than length()
(alias: char_length()
). Concatenation with concat()
defends against possible NULL values.
To also test for the backtick, add:
...
OR concat(a.address_line_1, a.address_line_2, a.address_line_3) LIKE '%`%'