I have a set of data that have accented ascii in them. I want to convert the accent to plain English alphabets. I achieve that with the following code :
import java.text.Normalizer;
import java.util.regex.Pattern;
public String deAccent(String str) {
String nfdNormalizedString = Normalizer.normalize(str, Normalizer.Form.NFD);
Pattern pattern = Pattern.compile("\\p{InCombiningDiacriticalMarks}+");
return pattern.matcher(nfdNormalizedString).replaceAll("");
}
But what this code is missing is the exclude characters, I don't know how I can exclude certain characters from the conversion, for example I want to exclude the letter "ü" from the word Düsseldorf so when I convert, it doesn't turn into Dusseldorf word. Is there a way to pass an exclude list to the method or the matcher and don't convert certain accented characters ?
Do not use normalization to remove accents!
For example, the following letters are not asciified using your method:
ł
đ
ħ
You may also want to split ligatures like œ
into separate letters (i.e. oe
).
Try this:
private static final String TAB_00C0 = "" +
"AAAAAAACEEEEIIII" +
"DNOOOOO×OUUUÜYTs" + // <-- note an accented letter you wanted
// and preserved multiplication sign
"aaaaaaaceeeeiiii" +
"dnooooo÷ouuuüyty" + // <-- note an accented letter and preserved division sign
"AaAaAaCcCcCcCcDd" +
"DdEeEeEeEeEeGgGg" +
"GgGgHhHhIiIiIiIi" +
"IiJjJjKkkLlLlLlL" +
"lLlNnNnNnnNnOoOo" +
"OoOoRrRrRrSsSsSs" +
"SsTtTtTtUuUuUuUu" +
"UuUuWwYyYZzZzZzs";
public static String toPlain(String source) {
StringBuilder sb = new StringBuilder(source.length());
for (int i = 0; i < source.length(); i++) {
char c = source.charAt(i);
switch (c) {
case 'ß':
sb.append("ss");
break;
case 'Œ':
sb.append("OE");
break;
case 'œ':
sb.append("oe");
break;
// insert more ligatures you want to support
// or other letters you want to convert in a non-standard way here
// I recommend to take a look at: æ þ ð fl fi
default:
if (c >= 0xc0 && c <= 0x17f) {
c = TAB_00C0.charAt(c - 0xc0);
}
sb.append(c);
}
}
return sb.toString();
}