java android non-ascii-characters lowercase

Android toLowerCase() issue with accented characters

My app has a feature to filter content based on some keywords. This is case insensitive so in order to work I first call String.toLowerCase() on the source content.

The issue I have is when the source is in upper case and contains accentuated characters like with the french word: "INVITÉ"

This word when set to lowercase using the device default locale returns "invité" The problem is that the last character is not the same as the lowercase character "é" Instead it's the combination of 2 chars: "e" 101 & " ' " 769

Because of this "invité" does not match "invité"

How can I solve this? I would prefer not to remove accentuated characters altogether

Solution

You should normalize the string like this.

String upper = "INVITÉ";
System.out.println(upper + " length=" + upper.length());
String lower = upper.toLowerCase();
System.out.println(lower + " length=" + lower.length());
String normalized = Normalizer.normalize(lower, Normalizer.Form.NFC);
System.out.println(normalized + " length=" + normalized.length());

output:

INVITÉ length=7
invité length=7
invité length=6

It also works for Japanese.

String japanese = "が";
System.out.println(japanese + " length=" + japanese.length());
String normalized = Normalizer.normalize(japanese, Normalizer.Form.NFC);
System.out.println(normalized + " length=" + normalized.length());

output:

が length=2
が length=1