Search code examples
pythonregexregex-alternation

Regex in Python: Separate words from numbers JUST when not in list (Variable exception)


This question is related to this one. I'd like to have variable exceptions which can receive a list of alphanumeric variables or null.

For instance, I have a dummy function that returns possible alphanumeric values which such letters and numbers have to stay together:

def get_substitutions(word):
    if word.lower() == 'h20':
        return 'h20'
    return None

In addition, I have the following main function getting those possible alphanumeric values that do not have to be separated. If the text variable (input) has an alphanumeric word in the exceptions then this will not be separated otherwise space is added :

import re

text='1ST STREET SCHOOL'

exceptions = list()

for word in re.sub(r'[^\w]+', ' ', text, 0, re.IGNORECASE).split():
    if get_substitutions(word):
        exceptions.extend([word.lower()])

exception_rx = '|'.join(map(re.escape, exceptions))
generic_rx = r'(?<=\d)(?=[^\d\s])|(?<=[^\d\s])(?=\d)'
rx = re.compile(rf'({exception_rx})|{generic_rx}', re.I)

print(rx.sub(lambda x: x.group(1) or " ", text))

However, when exception_rx is null, then I am getting space between each letter:

1 S T   S T R E E T   S C H O O L 

Is possible to handle this scenario without including any if statement and just using regex syntax?

Thanks for your help


Solution

  • It is impossible to make the regex like ()|abc match abc, because () matches any string and any location in the string (that is why you get a space before each char). As in any other NFA regex, the first alternative in a group with | that matches makes the regex engine stop analyzing the further alternatives on the right, they are all skipped. See Remember That The Regex Engine Is Eager.

    In this situation, you may work around the problem by initializing the exceptions list with a word that you will nevery find in any text.

    For example,

    exceptions = ['n0tXistIнgŁąrd']