Search code examples
sqlregexpostgresqlfind

Finding values where the content consists of allowed characters with optional x wildcards


I'm looking to filter rows with strings by a set of allowed characters so that the remaining rows only consist of the allowed characters. But it should also allow for a x number of wildcards.

Any character in the allowed list is allowed to be used unlimited times so a matches aaaaaa. The wildcards could be either 1 use per wildcard or unlimited per wildcard, preferably a solution for both situations.

So an example where the allowed characters are [a,b,c] with 0 to 2 wildcards:

DataSet No Wildcards 1 Wildcard 2 Wildcards
id value id value id value id value
1 aabc 1 aabc 1 aabc 1 aabc
2 aabcd 4 cba 2 aabcd 2 aabcd
3 axbd 4 cba 3 axbd
4 cba 4 cba

The closest I've gotten is using regex for the 1 wildcard question, but that isn't scalable.

SELECT * FROM table WHERE value ~* '^[abc -]*$|^[a-z][abc -]*$|^[abc -]*[a-z]$|^[abc -]*[a-z][abc -]*$'

Solution

  • Use regexp_split_to_table to split "value" into rows (characters).

    Then exclude a,b,c and group rows by "id".

    Just change the count value "0" for Wildcard.

    select * from "table" where not exists (
        select id from (
            select id, count(*) as count from (
                select id, regexp_split_to_table(value,'') as r
                from "table"
            ) a
            where r ~ '^[^abc]$'
            group by id
        ) b where count > 0 and "table".id = b.id
    )-- change the number ^ for Wildcard
    

    Result: No Wildcards

    id value
    1 aabc
    4 cba

    1 Wildcard

    id value
    1 aabc
    2 aabcd
    4 cba

    .


    Details

    From three to eight lines SQL above returns the table below.

    id count
    3 2
    2 1

    This table's count column shows the number of non-abc characters in "value" text.

    Wildcard means non-abc characters so we just exclude the ids using "not exists" operator.

    We can control the number of Wildcards with the number after "count >" query text.