Writing a straight quote to curly quote converter and am looking to separate substitution into a few different steps. The first step is to replace contractions in text using a lexicon of known contractions. This won't solve ambiguities, but should convert straight quote usages in common contractions.
In Java, \b
and \w
don't include apostrophes as part of a word, which makes this problem a bit finicky. The issue is in matching words that:
Given some nonsensical text:
'Twas---Wha'? Wouldn'tcha'? 'Twas, or 'twasn't, 'tis what's 'tween dawn 'n' dusk 'n stuff. Cookin'? 'Sams' place, 'yo''
the regexes should capture the following words:
what's
'Twas
, 'Twas
, 'twasn't
, 'tis
, 'tween
, 'n
Wha'
, Wouldn'tcha'
, Cookin'
'n'
, 'Sams'
, 'yo'
Here are non-working expressions, a mix-mash of maladroit thoughts:
\p{L}+'\p{L}*\p{L}
((?<=[^\p{L}])|^)'\p{L}+('\p{L}|\p{L})?
(\p{L}|\p{L}')+'(?=[^\p{L}]|$)
This one appears to work:
((?<=[^\p{L}])|^)'\p{L}+'(?!\p{L})
What regular expressions would categorize this quartet of contractions correctly?
This regex should do what you want. It uses named capture groups to categorise the words with appropriate lookarounds to ensure that we match the whole words with the required outer quotes:
(?<inner>(?<![\p{L}'])(?:\p{L}+')+\p{L}+(?![\p{L}']))|
(?<began>(?<!\p{L})(?:'\p{L}+)+(?![\p{L}']))|
(?<ended>(?<![\p{L}'])(?:\p{L}+')+(?!\p{L}))|
(?<outer>(?<!\p{L})'\p{L}+'(?!\p{L}))
Group inner
looks for a string with some number of groups of letters followed by a quote (?:\p{L}+')+
followed by some number of letters \p{L}+
.
Group began
looks for a string with some number of groups of a quote followed by some number of letters (?:'\p{L}+)+
.
Group ended
looks for a string with some number of groups of letters followed by a quote (?:\p{L}+')+
.
Group outer
looks for a string with quotes on either end and some number of letters in the middle '\p{L}+'
.