sql regex postgresql non-ascii-characters

ERROR: argument of OR must not return a set

Running PostgreSQL 9.2.8 here...

I've been trying to obtain a list of all rows in a table that have non-ASCII characters in the address - anything outside the range of <space> to ~, as well as the backtick character ` . If any of the rows contain any invalid characters, the row with all address values would display. But for some reason I'm getting the following error:

ERROR:  argument of OR must not return a set (10586)
LINE 9: (CAST(regexp_matches(a.address_line_1,'([^ !-~]|`)') AS VARCHAR)...
         ^

********** Error **********

ERROR: argument of OR must not return a set (10586)
SQL state: 42804
Character: 252

The query I've been trying to use is as follows:

select a.address_id, a.address_line_1, 
    a.address_line_2, 
    a.address_line_3, 
regexp_matches(a.address_line_1,'([^ !-~]|`)'),
regexp_matches(a.address_line_2,'([^ !-~]|`)'),
regexp_matches(a.address_line_3,'([^ !-~]|`)')
    FROM public.address a 
WHERE 
(CAST(regexp_matches(a.address_line_1,'([^ !-~]|`)') AS VARCHAR) <> '') OR
(CAST(regexp_matches(a.address_line_2,'([^ !-~]|`)') AS VARCHAR) <> '') OR
(CAST(regexp_matches(a.address_line_3,'([^ !-~]|`)') AS VARCHAR) <> '')
LIMIT 1000

I'm not sure what I could be missing as this seems like a valid query.

I am trying to get rows where there is an invalid character in ANY of the three address fields, not just an invalid character in all three.

Solution

regexp_matches() returns SETOF text and cannot be used like you tried (as the error message tells you). You could use the regular expression operator ~ instead.

But your regular expression does not seem to cover what you describe as:

non-ASCII characters in the address

Also, the range !-~ in your bracket expression [^ !-~] depends on your COLLATION setting. The manual warns:

Ranges are very collating-sequence-dependent, so portable programs should avoid relying on them.

Consider:

SELECT g, chr(g), chr(g) ~ '([^ !-~]|`)'
FROM   generate_series (1,300) g;  -- ASCII range plus some

Assuming server encoding UTF8, to find rows with any non-ASCII characters in the 3 columns:

...
WHERE octet_length(concat(a.address_line_1, a.address_line_2, a.address_line_3))
         <> length(concat(a.address_line_1, a.address_line_2, a.address_line_3))

This works because all non-ASCII characters are encoded with more than 1 byte in UTF8, hence octet_length() reports a higher number than length() (alias: char_length()). Concatenation with concat() defends against possible NULL values.

To also test for the backtick, add:

...
OR  concat(a.address_line_1, a.address_line_2, a.address_line_3) LIKE '%`%'