i am working on an arabic dictionary and i am getting sentences like
String original = "'أَبَنَ فُلانًا: عَابَه ورَمَاه بخَلَّة سَوء.'";
from my database but i cant process the sentence without removing the accents and punctuation
i tried using
import java.text.Normalizer;
import java.text.Normalizer.Form;
import java.util.regex.Pattern;
public static String deAccent(String str) {
String nfdNormalizedString = Normalizer.normalize(str, Normalizer.Form.NFD);
Pattern pattern = Pattern.compile("\\p{InCombiningDiacriticalMarks}+");
return pattern.matcher(nfdNormalizedString).replaceAll("");
}
but it didnt work
Why don't you just go for the Unicode punctuation / mark, non-spacing categories?
Not sure of your expected result as it's not posted - and I can't read Arabic :), but try this code:
String input = "'أَبَنَ فُلانًا: عَابَه ورَمَاه بخَلَّة سَوء.'";
Pattern p = Pattern.compile("[\\p{P}\\p[Mn]");
Matcher m = p.matcher(input);
while (m.find()) {
System.out.println("found: " + m.group());
}
m.reset();
System.out.println("Replaced: " + m.replaceAll(" "));
Output:
found: '
found: َ
found: َ
found: َ
found: ُ
found: ً
found: :
found: َ
found: َ
found: َ
found: َ
found: َ
found: ّ
found: َ
found: َ
found: .
found: '
Replaced: أ ب ن ف لان ا ع اب ه ور م اه بخ ل ة س وء
I suppose it's not your desired final result, but I hope it's something you can work with.
Also, this is a gold mine of information on the Unicode categories. I believe most are applicable in a Java Pattern
.