Search code examples
javaregexstringcharspecial-characters

Java String Filter out unwanted characters


I have string like this:

−+-~*/@$^#¨%={}[häagen-dazs;:] a (le & co') jsou "výborné" <značky>?!.

And I want to end up with this:

häagen-dazs a le & co jsou výborné značky.

In comparison to How to filter string for unwanted characters using regex? I want to keep accent (diacritics) in the string.

I use following replaceAll:

str.replaceAll("[¨%=;\\:\\(\\)\\$\\[\\]\\{\\}\\<\\>\\+\\*\\−\\@\\#\\~\\?\\!\\^\\'\\\"\\|\\/]", "");
  • Is this correct approach?
  • Is there a more simple way how to keep only alphanumeric characters (as well as with accent), spaces, and & . - symbols?

Solution

  • You can loop through all the input String characters and test each one if it matches your wanted Regex keep it, use this Regex [a-zA-Z& \\-_\\.ýčéèêàâùû] to test upon each character individually.

    This is the code you need:

        String input = "−+-~*/@$^#¨%={}[häagen-dazs;:] a (le & co') jsou výborné <značky>?!";
        StringBuffer sb =  new StringBuffer();
        for(char c : input.toCharArray()){
           if((Character.toString(c).toLowerCase()).matches("[a-zA-Z& \\-_\\.ýčéèêàâùû]")){
               sb.append(c);
           }
        }
        System.out.println(sb.toString()); 
    

    Demo:

    And here's a working Demo that uses this code and gives the following output:

    -hagen-dazs. a le & co jsou výborné značky
    

    Note:

    • It uses input.toCharArray() to get an array of chars and loop over it.
    • It uses (Character.toString(c).toLowerCase()).matches("[a-zA-Z& \\-_\\.ýčéèêàâùû]") to test if the iterated char matches the allowed characters Regex.
    • It uses a StringBuffer to construct a new String with only the allowed characters.