I would like to go from a naturally written string list to a python list.
Sample inputs:
s1 = 'make the cake, walk the dog, and pick-up poo.'
s2 = 'flour, egg-whites and sand.'
The output:
split1 = ['make the cake', 'walk the dog', 'pick-up poo']
split2 = ['flour', 'egg-whites', 'sand']
I want to split the strings on commas (and periods), 'and', and 'or', while removing the splits and empty strings. Because of the lack of standardization in using the oxford comma, I cannot just split on commas.
I tried the following:
import re
[x.strip() for x in re.split('([A-Za-z -]+)', s1) if x not in ['', ',', '.']]
Which gives:
['make the cake', 'walk the dog', 'and pick-up poo']
Which is close. But for s2
it gives:
['flour', 'egg-whites and sand']
I can do some post processing across elements to continually split elements by (and|or)
, but I really would like to tokenize by the set of commas, and's, and or's.
I've tried some fancy regex splits to have a negative look ahead for something like and
, but it doesn't want to split on that word.
[x.strip() for x in re.split('([A-Za-z -]+(?!and))', s2) if x not in ['', ',', '.']]
[x.strip() for x in re.split('([A-Za-z -]+(?!\band\b))', s2) if x not in ['', ',', '.']]
Which also gives
['flour', 'egg-whites and sand']
I realize there's a lot of edge cases but I feel like I'm close and just missing something small.
You can use
\s*(?:\b(?:and|or)\b|[,.])\s*
See the regex demo. Details:
\s*
- 0+ whitespaces(?:\b(?:and|or)\b|[,.])
- either a whole word and
or or
, or a comma/period\s*
- 0+ whitespacesSee a Python demo:
import re
rx = re.compile(r"\s*(?:\b(?:and|or)\b|[,.])\s*")
strings = ["make the cake, walk the dog, and pick-up poo.", "flour, egg-whites and sand."]
for s in strings:
print( list(filter(None, rx.split(s))) )
Note that a comma or period are often "excluded" when followed or enclosed with digits, you may consider replacing [.,]
with [,.](?!\d)
or [,.](?!(?<=\d[,.])\d)
.