Search code examples
javaregextext-parsing

English contraction catastrophes


Background

Writing a straight quote to curly quote converter and am looking to separate substitution into a few different steps. The first step is to replace contractions in text using a lexicon of known contractions. This won't solve ambiguities, but should convert straight quote usages in common contractions.

Problem

In Java, \b and \w don't include apostrophes as part of a word, which makes this problem a bit finicky. The issue is in matching words that:

  • contain one or more apostrophes, but do not start or end with one (inner);
  • begin with an apostrophe, may contain one or more, but do not end with one (began);
  • end with an apostrophe, may contain one or more, but do not start with one (ended); and
  • begin and end with an apostrophe, but may not contain one (outer).

Code

Given some nonsensical text:

'Twas---Wha'? Wouldn'tcha'? 'Twas, or 'twasn't, 'tis what's 'tween dawn 'n' dusk 'n stuff. Cookin'? 'Sams' place, 'yo''

the regexes should capture the following words:

  • inner: what's
  • began: 'Twas, 'Twas, 'twasn't, 'tis, 'tween, 'n
  • ended: Wha', Wouldn'tcha', Cookin'
  • outer: 'n', 'Sams', 'yo'

Here are non-working expressions, a mix-mash of maladroit thoughts:

  • inner: \p{L}+'\p{L}*\p{L}
  • began: ((?<=[^\p{L}])|^)'\p{L}+('\p{L}|\p{L})?
  • ended: (\p{L}|\p{L}')+'(?=[^\p{L}]|$)

This one appears to work:

  • outer: ((?<=[^\p{L}])|^)'\p{L}+'(?!\p{L})

Question

What regular expressions would categorize this quartet of contractions correctly?


Solution

  • This regex should do what you want. It uses named capture groups to categorise the words with appropriate lookarounds to ensure that we match the whole words with the required outer quotes:

    (?<inner>(?<![\p{L}'])(?:\p{L}+')+\p{L}+(?![\p{L}']))|
    (?<began>(?<!\p{L})(?:'\p{L}+)+(?![\p{L}']))|
    (?<ended>(?<![\p{L}'])(?:\p{L}+')+(?!\p{L}))|
    (?<outer>(?<!\p{L})'\p{L}+'(?!\p{L}))
    

    Group inner looks for a string with some number of groups of letters followed by a quote (?:\p{L}+')+ followed by some number of letters \p{L}+.

    Group began looks for a string with some number of groups of a quote followed by some number of letters (?:'\p{L}+)+.

    Group ended looks for a string with some number of groups of letters followed by a quote (?:\p{L}+')+.

    Group outer looks for a string with quotes on either end and some number of letters in the middle '\p{L}+'.

    Demo on regex101