I'm trying to remove all diacritical marks from a string during a validation (for more background, see below). In order to do that, I'm using the following code:
private static String stripAccents(final String s) {
if(s == null) {
return "";
}
return Normalizer.normalize(s, Normalizer.Form.NFD).replaceAll("[\\p{InCombiningDiacriticalMarks}]", "");
}
My problem is this doesn't work for the character "ø" , which stays as is. After looking the character class "InCombiningDiacriticalMarks" up, I found this question: What built-in regex character classes are supported Java
This lead me to the official unicode list of everything considered a diacritical mark, here: https://www.unicode.org/charts/PDF/U0300.pdf , and the code point 0338 seems to match "ø" pretty well.
Am I missing something, or is the character class "InCombiningDiacriticalMarks" not fully supported in java?
As to WHY I need this, some background:
I'm sending data containing scandinavian characters to the outside, and when they send the data back, they have the funny habit of removing or even replacing diacritical marks (e.g. ø becomes ö). I tried to make them do it right, but just won't, and I have no way of forcing them to.
So in order to compare the data to verify what was sent is what we get back, I have to remove all diacritical marks to avoid a ton of false positives.
So just like Jesper mentioned, the problem is that the character "ø" is NOT an "o" with a diacritical mark, but is considered a full-fledged character that can itself take diacritical marks, like the "ø̈" in "Grø̈nland" (see https://en.wikipedia.org/wiki/%C3%98).
As a result, the only way to programmatically convert "ø" to "o" (which is what I needed) is to explicitly replace "ø" with "o". The method mentionned above therefore becomes:
private String stripAccents(final String s) {
if(s == null) {
return "";
}
return Normalizer.normalize(s, Normalizer.Form.NFD).replaceAll("[\\p{InCombiningDiacriticalMarks}]", "").replaceAll("ø", "o");
}