Search code examples
javautf-8asciistring-comparison

UTF to ASCII for comparison in JAVA


I have a list of strings and i want to compare it with "singleArgument" , i dont want it to be case sensitive so i made a method to make it lowerCase but also i dont want special characters to mess up comparison so if im looking for "ščž" singleArgument can be "scz"

case noCaseSensitive:
  final String patternSourceILike = (String) singleArgument;
  verdict = buildPattern(patternSourceILike.toLowerCase(Locale.ROOT))
    .matcher(((String) resolvedValue).toLowerCase(Locale.ROOT))
    .matches();
  break;

this i have for no case sensitive comparison.

If i convert string from utf8 to ascii and than compare it turns special characters to unknown characters.


Solution

  • No idea why you'd want to do this, since removing diacritics from letters makes them completely different letters, but you can use java.text.Normalizer for this: normalize the text to its canonical decomposition, then replace all "not ascii letters" with empty strings to strip out all (now separate) diacritics.

    import java.text.Normalizer;
    
    public class Test {
       public static void main(String []args) {
         String input = "\u0161\u010D\u017E"; // ščž
         String canonical = Normalizer.normalize(input,  Normalizer.Form.NFD);
         String ascii = canonical.replaceAll("\\W", "");
         String output = String.format("%s, %s", input, ascii);
         System.out.println(output); // "ščž, scz"
      }
    }