I have made a program that counts the frequency of a word in a very long string. My problem is that the program is counting for example "*it" (consider * a quotation mark) and "it" as different words and therefore putting them in different categories.
I tried to replace all the punctuation marks I know of with the following code:
text = text.replace("\n", " ");
text = text.replaceAll("\\p{Punct}", " ");
text = text.replace("\"", "");
text = text.replace("–", "");
text = text.replace("\t", "");
Unfortunately, the code didn't work and I think it is because there is a lot of different quotation marks in Unicode that I can't see a difference between, so is there a way to remove all Unicode characters except letters and whitespaces with the String.replaceAll method or do I have to make a CharArray and continue from there?
Thanks a lot, any help would be appreciated.
I think this might do it
text = text.replaceAll("[^a-zA-Z0-9 ]", "");
which will remove all the characters which are not either alphanumeric or special characters.
EDIT :-
As suggesed by @npinti
text = text.replaceAll("[^\\p{L}0-9 ]", "");