Ok, so I feel like this question for asked many times but I am not able to find an answer. I am comparing two different files that were generated by two different programs. Of course both programs are generating the files from the same db queries. I am running into the following differences:
s1 =
Samsung - Mobile USB Chargers
vs.
s2 =
Samsung \u2013 Mobile USB Chargers
How do I convert s2 to s1 or even better, how do I compare the two without getting a difference? Someone somewhere on the wide wide internets mentioned to use ApacheCommons-lang's StringUtils class, but I couldn't find anything useful.
You could fold all the characters with the Dash_Punctuation property.
This code will print true
:
boolean equal = "Samsung \u2013 Mobile USB Chargers"
.replaceAll("\\p{Pd}", "-")
.equals("Samsung - Mobile USB Chargers");
System.out.println(equal);
Note that this will apply to all characters with that property (like 〰 U+3030 WAVY DASH). A comprehensive list of characters with the Dash_Punctuation (Pd) property are in UnicodeData.txt. Java 6 supports Unicode 4. See chapter 6 for a discussion of punctuation.