I'm looking to filter rows with strings by a set of allowed characters so that the remaining rows only consist of the allowed characters. But it should also allow for a x number of wildcards.
Any character in the allowed list is allowed to be used unlimited times so a matches aaaaaa. The wildcards could be either 1 use per wildcard or unlimited per wildcard, preferably a solution for both situations.
So an example where the allowed characters are [a,b,c] with 0 to 2 wildcards:
DataSet | No Wildcards | 1 Wildcard | 2 Wildcards | |||||||
---|---|---|---|---|---|---|---|---|---|---|
id | value | id | value | id | value | id | value | |||
1 | aabc | 1 | aabc | 1 | aabc | 1 | aabc | |||
2 | aabcd | 4 | cba | 2 | aabcd | 2 | aabcd | |||
3 | axbd | 4 | cba | 3 | axbd | |||||
4 | cba | 4 | cba |
The closest I've gotten is using regex for the 1 wildcard question, but that isn't scalable.
SELECT * FROM table WHERE value ~* '^[abc -]*$|^[a-z][abc -]*$|^[abc -]*[a-z]$|^[abc -]*[a-z][abc -]*$'
Use regexp_split_to_table to split "value" into rows (characters).
Then exclude a,b,c and group rows by "id".
Just change the count value "0" for Wildcard.
select * from "table" where not exists (
select id from (
select id, count(*) as count from (
select id, regexp_split_to_table(value,'') as r
from "table"
) a
where r ~ '^[^abc]$'
group by id
) b where count > 0 and "table".id = b.id
)-- change the number ^ for Wildcard
Result: No Wildcards
id | value |
---|---|
1 | aabc |
4 | cba |
1 Wildcard
id | value |
---|---|
1 | aabc |
2 | aabcd |
4 | cba |
.
Details
From three to eight lines SQL above returns the table below.
id | count |
---|---|
3 | 2 |
2 | 1 |
This table's count column shows the number of non-abc characters in "value" text.
Wildcard means non-abc characters so we just exclude the ids using "not exists" operator.
We can control the number of Wildcards with the number after "count >" query text.