Search code examples
javacharacter-encodingnon-ascii-charactersextended-ascii

Using Java Normalizer to convert accent ascii to non-accent but to exclude some symboles


I have a set of data that have accented ascii in them. I want to convert the accent to plain English alphabets. I achieve that with the following code :

import java.text.Normalizer;
import java.util.regex.Pattern;

public String deAccent(String str) {
    String nfdNormalizedString = Normalizer.normalize(str, Normalizer.Form.NFD); 
    Pattern pattern = Pattern.compile("\\p{InCombiningDiacriticalMarks}+");
    return pattern.matcher(nfdNormalizedString).replaceAll("");
}

But what this code is missing is the exclude characters, I don't know how I can exclude certain characters from the conversion, for example I want to exclude the letter "ü" from the word Düsseldorf so when I convert, it doesn't turn into Dusseldorf word. Is there a way to pass an exclude list to the method or the matcher and don't convert certain accented characters ?


Solution

  • Do not use normalization to remove accents!

    For example, the following letters are not asciified using your method:

    • ł

    • đ

    • ħ

    You may also want to split ligatures like œ into separate letters (i.e. oe).

    Try this:

    private static final String TAB_00C0 = "" +
            "AAAAAAACEEEEIIII" +
            "DNOOOOO×OUUUÜYTs" + // <-- note an accented letter you wanted 
                                 //     and preserved multiplication sign
            "aaaaaaaceeeeiiii" +
            "dnooooo÷ouuuüyty" + // <-- note an accented letter and preserved division sign
            "AaAaAaCcCcCcCcDd" +
            "DdEeEeEeEeEeGgGg" +
            "GgGgHhHhIiIiIiIi" +
            "IiJjJjKkkLlLlLlL" +
            "lLlNnNnNnnNnOoOo" +
            "OoOoRrRrRrSsSsSs" +
            "SsTtTtTtUuUuUuUu" +
            "UuUuWwYyYZzZzZzs";
    
    public static String toPlain(String source) {
        StringBuilder sb = new StringBuilder(source.length());
        for (int i = 0; i < source.length(); i++) {
            char c = source.charAt(i);
            switch (c) {
                case 'ß':
                    sb.append("ss");
                    break;
                case 'Œ':
                    sb.append("OE");
                    break;
                case 'œ':
                    sb.append("oe");
                    break;
                // insert more ligatures you want to support 
                // or other letters you want to convert in a non-standard way here
                // I recommend to take a look at: æ þ ð fl fi
                default:
                    if (c >= 0xc0 && c <= 0x17f) {
                        c = TAB_00C0.charAt(c - 0xc0);
                    }
                    sb.append(c);
            }
        }
        return sb.toString();
    }