My app has a feature to filter content based on some keywords.
This is case insensitive so in order to work I first call String.toLowerCase()
on the source content.
The issue I have is when the source is in upper case and contains accentuated characters like with the french word: "INVITÉ"
This word when set to lowercase
using the device default locale returns "invité"
The problem is that the last character is not the same as the lowercase character "é"
Instead it's the combination of 2 chars:
"e" 101 &
" ' " 769
Because of this "invité" does not match "invité"
How can I solve this? I would prefer not to remove accentuated characters altogether
You should normalize the string like this.
String upper = "INVITÉ";
System.out.println(upper + " length=" + upper.length());
String lower = upper.toLowerCase();
System.out.println(lower + " length=" + lower.length());
String normalized = Normalizer.normalize(lower, Normalizer.Form.NFC);
System.out.println(normalized + " length=" + normalized.length());
output:
INVITÉ length=7
invité length=7
invité length=6
It also works for Japanese.
String japanese = "が";
System.out.println(japanese + " length=" + japanese.length());
String normalized = Normalizer.normalize(japanese, Normalizer.Form.NFC);
System.out.println(normalized + " length=" + normalized.length());
output:
が length=2
が length=1