I am trying to replace all instances of sentence terminators such as '.', '?', and '!', but I do not want to replace strings like "dr." and "mr.".
I have tried the following:
text = text.replaceAll("(?![mr|mrs|ms|dr])(\\s*[\\.\\?\\!]\\s*)", "\n");
...but that does not seem to work. Any suggestions would be appreciated.
private String convertText(String text) {
text = text.replaceAll("\\s+", " ");
text = text.replaceAll("[\n\r\\(\\)\"\\,\\:]", "");
text = text.replaceAll("(?i)(?<!dr|mr|mrs|ms|jr|sr|\\s\\w)(\\s*[\\.\\?\\!\\;](?:\\s+|$))","\r\n");
return text.trim();
}
The code will extract all* compound and single sentences from an excerpt of text, removing all punctuation and extraneous white-space.
*There are some exceptions...
You need to use negative lookbehind instead of negative lookahead like this
String x = "dr. house.";
System.out.println(x.replaceAll("(?<!mr|mrs|ms|dr)(\\s*[\\.\\?\\!]\\s*)","\n"));
Also the list of mr/dr/ms/mrs
should not be inside character classes.