Given a list of glossaries:
glossaries = ['USA', '34']
The goal is use the items inside the glossaries and split a string using the glossaries as delimiters. E.g. given the string and the glossaries, an _isolate_glossaries()
function:
glossaries = ['USA', '34']
word = '1934USABUSA'
_isolate_glossaries(word, glossaries)
should output:
['19', '34', 'USA', 'B', 'USA']
I've tried:
def isolate_glossary(word, glossary):
print(word, glossary)
# Check that word == glossary and glossary not in word
if re.match('^{}$'.format(glossary), word) or not re.search(glossary, word):
return [word]
else:
segments = re.split(r'({})'.format(glossary), word)
segments, ending = segments[:-1], segments[-1] # Remove the last catch with null string.
return segments
def _isolate_glossaries(word, glossaries):
word_segments = [word]
for gloss in glossaries:
word_segments = [out_segment
for segment in word_segments
for out_segment in isolate_glossary(segment, gloss)]
return word_segments
It works but it looks a little too convoluted to have so many levels of loop and regex splits taking place. Is there a better way to split the string based on the glossaries?
To split the string by the items in the list, create a regex on the fly including those items separated by a pipe |
all enclosed in a capturing group (a non-capturing group doesn't include items themselves in the output):
list = re.split('({})'.format('|'.join(glossaries)), word);
print ([x for x in list if x]) # filter non-word items
See live demo here