java regex string char special-characters

Java String Filter out unwanted characters

I have string like this:

−+-~*/@$^#¨%={}[häagen-dazs;:] a (le & co') jsou "výborné" <značky>?!.

And I want to end up with this:

häagen-dazs a le & co jsou výborné značky.

In comparison to How to filter string for unwanted characters using regex? I want to keep accent (diacritics) in the string.

I use following replaceAll:

str.replaceAll("[¨%=;\\:\\(\\)\\$\\[\\]\\{\\}\\<\\>\\+\\*\\−\\@\\#\\~\\?\\!\\^\\'\\\"\\|\\/]", "");

Is this correct approach?
Is there a more simple way how to keep only alphanumeric characters (as well as with accent), spaces, and & . - symbols?

Solution

You can loop through all the input String characters and test each one if it matches your wanted Regex keep it, use this Regex [a-zA-Z& \\-_\\.ýčéèêàâùû] to test upon each character individually.

This is the code you need:

    String input = "−+-~*/@$^#¨%={}[häagen-dazs;:] a (le & co') jsou výborné <značky>?!";
    StringBuffer sb =  new StringBuffer();
    for(char c : input.toCharArray()){
       if((Character.toString(c).toLowerCase()).matches("[a-zA-Z& \\-_\\.ýčéèêàâùû]")){
           sb.append(c);
       }
    }
    System.out.println(sb.toString());

Demo:

And here's a working Demo that uses this code and gives the following output:

-hagen-dazs. a le & co jsou výborné značky

Note:

It uses input.toCharArray() to get an array of chars and loop over it.
It uses (Character.toString(c).toLowerCase()).matches("[a-zA-Z& \\-_\\.ýčéèêàâùû]") to test if the iterated char matches the allowed characters Regex.
It uses a StringBuffer to construct a new String with only the allowed characters.