Search code examples
javastringunicodecodepoint

java string unicode code point convert to character


Ok, so I feel like this question for asked many times but I am not able to find an answer. I am comparing two different files that were generated by two different programs. Of course both programs are generating the files from the same db queries. I am running into the following differences:

s1 = Samsung - Mobile USB Chargers

vs.

s2 = Samsung \u2013 Mobile USB Chargers

How do I convert s2 to s1 or even better, how do I compare the two without getting a difference? Someone somewhere on the wide wide internets mentioned to use ApacheCommons-lang's StringUtils class, but I couldn't find anything useful.


Solution

  • You could fold all the characters with the Dash_Punctuation property.

    This code will print true:

    boolean equal = "Samsung \u2013 Mobile USB Chargers"
                        .replaceAll("\\p{Pd}", "-")
                        .equals("Samsung - Mobile USB Chargers");
    System.out.println(equal);
    

    Note that this will apply to all characters with that property (like 〰 U+3030 WAVY DASH). A comprehensive list of characters with the Dash_Punctuation (Pd) property are in UnicodeData.txt. Java 6 supports Unicode 4. See chapter 6 for a discussion of punctuation.