Search code examples
javareplacereplaceall

Is there a regex to the String.replaceAll method that only keeps letters and white spaces


I have made a program that counts the frequency of a word in a very long string. My problem is that the program is counting for example "*it" (consider * a quotation mark) and "it" as different words and therefore putting them in different categories.

I tried to replace all the punctuation marks I know of with the following code:

text = text.replace("\n", " ");
text = text.replaceAll("\\p{Punct}", " ");
text = text.replace("\"", "");
text = text.replace("–", "");
text = text.replace("\t", "");

Unfortunately, the code didn't work and I think it is because there is a lot of different quotation marks in Unicode that I can't see a difference between, so is there a way to remove all Unicode characters except letters and whitespaces with the String.replaceAll method or do I have to make a CharArray and continue from there?

Thanks a lot, any help would be appreciated.


Solution

  • I think this might do it

    text = text.replaceAll("[^a-zA-Z0-9 ]", "");
    

    which will remove all the characters which are not either alphanumeric or special characters.

    EDIT :-

    As suggesed by @npinti

    text = text.replaceAll("[^\\p{L}0-9 ]", "");