I am trying to match rude words in user inputs, for example "I Hate You!" or "i.håté.Yoù" will match with "hate you" in an array of words parsed from JSON.
So I need it to be case and diacritic insensitive and to treat whitespaces in the rude words as any non-letter character:
regex metacharacter \P{L}
should work for that, or at least \W
Now I know [cd]
works with NSPredicate
, like this:
func matches(text: String) -> [String]? {
if let rudeWords = JSON?["words"] as? [String]{
return rudeWords.filter {
let pattern = $0.stringByReplacingOccurrencesOfString(" ", withString: "\\P{L}", options: .CaseInsensitiveSearch)
return NSPredicate(format: "SELF MATCHES[cd] %@", pattern).evaluateWithObject(text)
}
} else {
log.debug("error fetching rude words")
return nil
}
}
That doesn't work with either metacharacters, I guess they are not parsed by NSpredicate
, so I tried using NSRegularExpression
like this:
func matches(text: String) -> [String]? {
if let rudeWords = JSON?["words"] as? [String]{
return rudeWords.filter {
do {
let pattern = $0.stringByReplacingOccurrencesOfString(" ", withString: "\\P{L}", options: .CaseInsensitiveSearch)
let regex = try NSRegularExpression(pattern: pattern, options: .CaseInsensitive)
return regex.matchesInString(text, options: [], range: NSMakeRange(0, text.characters.count)).count > 0
}
catch _ {
log.debug("error parsing rude word regex")
return false
}
}
} else {
log.debug("error fetching rude words")
return nil
}
}
This seem to work OK however there is no way that I know to make regex diacritic insensitive, so I tried this (and other solutions like re-encoding)
let text = text.stringByFoldingWithOptions(.DiacriticInsensitiveSearch, locale: NSLocale.currentLocale())
However, this does not work for me since I check user input every time a character is typed so all the solutions I tried to strip accents made the app extremely slow.
Does someone know if there any other solutions or if I am using this the wrong way ?
Thanks
I was actually mistaken, what was making the app slow was trying to match with \P{L}
, I tried the second soluton with \W
and with the accent-stripping line, now it works OK even if it matches with less strings than I initially wanted.
These might help some people dealing with regex and predicates:
It might be worthwhile to go in a different direction. Instead of flattening the input, what if you changed the regex?
Instead of matching against hate.you
, could match against [h][åæaàâä][t][ëèêeé].[y][o0][ùu]
, for example (it's not a comprehensive list, in any case). It would make most sense to do this transformation on the fly (not storing it) because it might be easier if you need to change what the characters expand to later.
This will give you some more control over what characters will match. If you look, I have 0
as a character matching o
. No amount of Unicode coercion could let you do that.