I want to map graphical/symbol characters to a simpler Java alternative where possible, for example:
My problem is I don't know what all the characters are so although it is technically easy enough to map the specific characters above it is difficult to do for every one, there could be hundreds.
I already have this code for removing accents ecetera
public static final Pattern DIACRITICS_AND_FRIENDS
= Pattern.compile("[\\p{InCombiningDiacriticalMarks}\\p{IsLm}\\p{IsSk}]+");
private static String stripDiacritics(String str) {
str = Normalizer.normalize(str, Normalizer.Form.NFD);
str = DIACRITICS_AND_FRIENDS.matcher(str).replaceAll("");
return str;
}
So I was wondering if there was something similar to help me with these symbol characters, note I don't want to ever remove them just replace with a simpler representation.
I found this Lucene filter that attempts to do what I'm trying to do by looking at each char with a unicode value greater than \u0080 and seeing if it has a mapping to a simpler character via a massive case statement
and alter version can be found by downloading the source code and looking in
org.apache.lucene.analysis.miscellaneous
package
So a reasonable attempt has already been made but rather difficult to work out which additional chars it covers that are not covered by the Normalizer method.