how could i remove arabic punctuation form a String in java

i am working on an arabic dictionary and i am getting sentences like
String original = "'أَبَنَ فُلانًا: عَابَه ورَمَاه بخَلَّة سَوء.'"; from my database but i cant process the sentence without removing the accents and punctuation

i tried using

import java.text.Normalizer;
import java.text.Normalizer.Form;
import java.util.regex.Pattern;

public static String deAccent(String str) {
    String nfdNormalizedString = Normalizer.normalize(str, Normalizer.Form.NFD); 
    Pattern pattern = Pattern.compile("\\p{InCombiningDiacriticalMarks}+");
    return pattern.matcher(nfdNormalizedString).replaceAll("");
}

but it didnt work

Solution

Why don't you just go for the Unicode punctuation / mark, non-spacing categories?

Not sure of your expected result as it's not posted - and I can't read Arabic :), but try this code:

String input = "'أَبَنَ فُلانًا: عَابَه ورَمَاه بخَلَّة سَوء.'";
Pattern p = Pattern.compile("[\\p{P}\\p[Mn]");
Matcher m = p.matcher(input);
while (m.find()) {
    System.out.println("found: " + m.group());
}
m.reset();
System.out.println("Replaced: " + m.replaceAll(" "));

Output:

found: '
found: َ
found: َ
found: َ
found: ُ
found: ً
found: :
found: َ
found: َ
found: َ
found: َ
found: َ
found: ّ
found: َ
found: َ
found: .
found: '
Replaced:  أ ب ن  ف لان ا  ع اب ه ور م اه بخ ل  ة س وء

I suppose it's not your desired final result, but I hope it's something you can work with.

Also, this is a gold mine of information on the Unicode categories. I believe most are applicable in a Java Pattern.