I have the following regex in Java:
String regex = "[^\\s\\p{L}\\p{N}]";
Pattern p = Pattern.compile(regex);
String phrase = "Time flies: "when you're having fun!" Can't wait, 'until' next summer :)";
String delimited = p.matcher(phrase).replaceAll("");
Right now this regex removes all non-spaces and nonAlphanumerics.
Input: Time flies: "when you're having fun!" Can't wait, 'until' next summer :)
Output: Time flies when youre having fun Cant wait until next summer
Problem is, I want to maintain the single quotes on words, such as you're, can't, etc. But want to remove single quotes that are at the end of a sentence, or surround a word, such as 'hello'. This is what I want:
Input: Time flies: "when you're having fun!" Can't wait, 'until' next summer :)
Output: Time flies when you're having fun Can't wait until next summer
How can I update my current regex to be able to do this? I need to keep the \p{L} and \p{N} as it has to work for more than one language.
Thanks!
This should do what you want, or come close:
String regex = "[^\\s\\p{L}\\p{N}']|(?<=(^|\\s))'|'(?=($|\\s))";
The regex has three alternatives separated by |
. It will match:
It works on the example you give. Where it might not work the way you want is if you have a word with a quote mark on one side, but not the other: "'Tis a shame that we couldn't visit James' house"
. Since the lookahead/behind only look at the character right before and after the quote, and doesn't look ahead to see if (say) the quote mark at the beginning of the word is followed by a quote mark at the end of the word, it will delete the quote marks on 'Tis and James'.