Search code examples
pythonregexstringtokenize

Split strings on commas, 'and's, 'or's


I would like to go from a naturally written string list to a python list.

Sample inputs:

s1 = 'make the cake, walk the dog, and pick-up poo.'
s2 = 'flour, egg-whites and sand.'

The output:

split1 = ['make the cake', 'walk the dog', 'pick-up poo']
split2 = ['flour', 'egg-whites', 'sand']

I want to split the strings on commas (and periods), 'and', and 'or', while removing the splits and empty strings. Because of the lack of standardization in using the oxford comma, I cannot just split on commas.

I tried the following:

import re
[x.strip() for x in re.split('([A-Za-z -]+)', s1) if x not in ['', ',', '.']]

Which gives:

['make the cake', 'walk the dog', 'and pick-up poo']

Which is close. But for s2 it gives:

['flour', 'egg-whites and sand']

I can do some post processing across elements to continually split elements by (and|or), but I really would like to tokenize by the set of commas, and's, and or's.

I've tried some fancy regex splits to have a negative look ahead for something like and, but it doesn't want to split on that word.

[x.strip() for x in re.split('([A-Za-z -]+(?!and))', s2) if x not in ['', ',', '.']]
[x.strip() for x in re.split('([A-Za-z -]+(?!\band\b))', s2) if x not in ['', ',', '.']]

Which also gives

['flour', 'egg-whites and sand']

I realize there's a lot of edge cases but I feel like I'm close and just missing something small.


Solution

  • You can use

    \s*(?:\b(?:and|or)\b|[,.])\s*
    

    See the regex demo. Details:

    • \s* - 0+ whitespaces
    • (?:\b(?:and|or)\b|[,.]) - either a whole word and or or, or a comma/period
    • \s* - 0+ whitespaces

    See a Python demo:

    import re
    rx = re.compile(r"\s*(?:\b(?:and|or)\b|[,.])\s*")
    strings = ["make the cake, walk the dog, and pick-up poo.", "flour, egg-whites and sand."]
    for s in strings:
        print( list(filter(None, rx.split(s))) )
    

    Note that a comma or period are often "excluded" when followed or enclosed with digits, you may consider replacing [.,] with [,.](?!\d) or [,.](?!(?<=\d[,.])\d).