Search code examples
javaregexstringjava.util.scannerdelimiter

How to remove exterior punctuation from a string using regular expressions


Given a string like below, remove any leading and trailing punctuation via regular expressions:

String a = "!?Don't.;, .:delete !the@ $actual string%";
String b = "Hyphenated-words, too!";

I know that the regex [\P{Alnum}] will target all non-alphanumeric characters, but how do I target ONLY the leading and trailing punctuation so I get...

a = "Don't delete the actual string";
b = "Hyphenated-words too";

... instead of:

a = "Dont delete the actual string";
b = "Hyphenated words too";

I just need the regular expression; not the actual code to remove the punctuation.


Solution

  • You want to match punctuation that is adjacent to a) a whitespace character OR b) the beginning or end.

    • your pattern preceded by (?<=^|\s) positive lookbehind, or

    • your pattern followed by (?=\s|$) positive lookahead

    To shorten the pattern, we could reword this a little bit to say that our punctuation block must either a) not preceded by some character that's not a whitespace or b) not followed by a character that's not a whitespace.

    • your pattern preceded by (?<!\S) negative lookbehind, or

    • your pattern followed by (?!\S) negative lookahead

    As a final note, you should use \p{Punct} instead of [\P{Alnum}] to match punctuation. See the comment by sln for details.

    Here is an example usage:

    String a = "!?Don't.;, .:delete !the@ $actual string%";
    String b = "Hyphenated-words, too!";
    String regex = "(?:(?<!\\S)\\p{Punct}+)|(?:\\p{Punct}+(?!\\S))";
    System.out.println(a.replaceAll(regex, ""));
    System.out.println(b.replaceAll(regex, ""));
    

    Output:

    Don't delete the actual string

    Hyphenated-words too