Search code examples
regexsplitnlptokenize

How to split strings based on a list of glossaries?


Given a list of glossaries:

glossaries = ['USA', '34']

The goal is use the items inside the glossaries and split a string using the glossaries as delimiters. E.g. given the string and the glossaries, an _isolate_glossaries() function:

glossaries = ['USA', '34']
word = '1934USABUSA'
_isolate_glossaries(word, glossaries)

should output:

['19', '34', 'USA', 'B', 'USA']

I've tried:

def isolate_glossary(word, glossary):
    print(word, glossary)
    # Check that word == glossary and glossary not in word
    if re.match('^{}$'.format(glossary), word) or not re.search(glossary, word):
        return [word]
    else:
        segments = re.split(r'({})'.format(glossary), word)
        segments, ending = segments[:-1], segments[-1] # Remove the last catch with null string.
        return segments

def _isolate_glossaries(word, glossaries):
    word_segments = [word]
    for gloss in glossaries:
        word_segments = [out_segment
                         for segment in word_segments 
                         for out_segment in isolate_glossary(segment, gloss)] 
    return word_segments

It works but it looks a little too convoluted to have so many levels of loop and regex splits taking place. Is there a better way to split the string based on the glossaries?


Solution

  • To split the string by the items in the list, create a regex on the fly including those items separated by a pipe | all enclosed in a capturing group (a non-capturing group doesn't include items themselves in the output):

    list = re.split('({})'.format('|'.join(glossaries)), word);
    print ([x for x in list if x]) # filter non-word items
    

    See live demo here