Search code examples
javaregexreplaceutf-8

Replace non-UTF-8 character from String XML by space in java


Checking this post, I'm able to replace the ’ character which is the apostrophe character from my String XML by space:

String s = "<content>abc’s house.</content>";     
s = s.replaceAll("[^\\x00-\\x7F]"," ");
System.out.println(s);

The issue is that it produces 3 spaces: abc s house., I'd say because of ’ having 3 characters maybe? But I need for that character to be replaced by just one space: abc s house.

If I use below approach, it works while running from eclipse, but when I compile it to an executable jar, then it converts the ’ by the and since the string doesn't have the it doesn't work. (I'm able to see this behavior by decompiling the jar and see the code):

s = s.replace("’", " ");

Solution

  • You can use + quantifier:

        String s = "abc’s house.";
        s = s.replaceAll("[^\\x00-\\x7F]+"," ");
        System.out.println(s);
    

    prints:

    abc s house.