Search code examples
regexregex-group

Regex matching a written list such as "New York, Texas, and Florida"


I need a regex that will match the following conditions for an arbitrarily long list where each capture can be multiple words. It will always have the oxford comma, if that helps.

  1. 'New York' #=> ['New York']

  2. 'New York and Texas' #=> ['New York', 'Texas']

  3. 'New York, Texas, and Florida' #=> ['New York', 'Texas', 'Florida']

I found that (.+?)(?:,|$)(?:\sand\s|$)? will match 1 and 3 but not 2.

And (.+?)(?:\sand\s|$) will match 1 and 2 but not 3.

How can I match all 3?


Solution

  • You may split the text with the following pattern:

    (?:\s*(?:\band\b|,))+\s*
    

    Details

    • (?:\s*(?:\band\b|,))+ - 1 or more occurrences of:
      • \s* - 0+ whitespaces
      • (?:\band\b|,) - and as a whole word or a comma
    • \s* - 0 or more whitespace characters.

    See the regex demo.

    Note you may make it a bit more efficient if your regex engine supports possessive quantifiers:

    (?:\s*+(?:\band\b|,))+\s*
          ^
    

    Or atomic groups:

    (?>\s*+(?:\band\b|,))+\s*
     ^^