Search code examples
javascriptregexword-boundary

JavaScript regular expression for word boundaries, tolerating in-word hyphens and apostrophes


I'm looking for a Regular Expression for JavaScript that will identify word boundaries in English, while accepting hyphens and apostrophes that appear inside words, but excluding those that appear alone or at the beginning or end of a word.

For example, for the sentence ...
  She said - 'That'll be all, Two-Fry.'
... I want the characters shown in grey below to be detected:
  Shesaid- 'That'llbeall,Two-Fry.'

If I use the regex /[^A-Za-z'-]/g, then "loose" hyphens and apostrophes are not detected.
  Shesaid-'That'llbeall,Two-Fry.'

How can I alter my regex so that it detects apostrophes and hyphens that don't have a word character on both sides?

You can test my regex here: https://regex101.com/r/bR8sV1/2

Note: the text I will be working on may contain other writing scripts, like руский and ไทอ so it will not be feasible to simply include all the characters that are not part of any English word.


Solution

  • You can organize your word-boundary characters into two groups.

    1. Characters that cannot be alone.
    2. Characters that can be alone.

    A regex that works with your example would be:

    [\s.,'-]{2,}|[\s.]
    

    Regex101 Demo

    Now all that's left is to keep adding all non-word characters into those two groups until it fits all of your needs. So you might start adding symbols and more punctuation to those character classes.