Search code examples
javaunicodenormalize

How can I map unicode symbols to a simpler latin script equivalent in Java


I want to map graphical/symbol characters to a simpler Java alternative where possible, for example:

  • U1E36 latin capital letter l with dot below -> L
  • U25B6 Black Right-Pointing Triangle -> >
  • U25C0 Black Left-Pointing Triangle -> <
  • U25B2 Black UP-Pointing Triangle -> ^

My problem is I don't know what all the characters are so although it is technically easy enough to map the specific characters above it is difficult to do for every one, there could be hundreds.

I already have this code for removing accents ecetera

public static final Pattern DIACRITICS_AND_FRIENDS
        = Pattern.compile("[\\p{InCombiningDiacriticalMarks}\\p{IsLm}\\p{IsSk}]+");


private static String stripDiacritics(String str) {
    str = Normalizer.normalize(str, Normalizer.Form.NFD);
    str = DIACRITICS_AND_FRIENDS.matcher(str).replaceAll("");
    return str;
}

So I was wondering if there was something similar to help me with these symbol characters, note I don't want to ever remove them just replace with a simpler representation.


Solution

  • I found this Lucene filter that attempts to do what I'm trying to do by looking at each char with a unicode value greater than \u0080 and seeing if it has a mapping to a simpler character via a massive case statement

    http://grepcode.com/file/repo1.maven.org/maven2/org.apache.lucene/lucene-core/2.9.1/org/apache/lucene/analysis/ASCIIFoldingFilter.java

    and alter version can be found by downloading the source code and looking in

    org.apache.lucene.analysis.miscellaneous
    

    package

    So a reasonable attempt has already been made but rather difficult to work out which additional chars it covers that are not covered by the Normalizer method.