Search code examples
pythonregexregex-lookarounds

Regex: Set together single characters as long as they are letters


Considering the following examples:

Original                       Regex
A-B-C SCHOOL INSTITUTION   --> ABC SCHOOL INSTITUTION
A B C SCHOOL INSTITUTION   --> ABC SCHOOL INSTITUTION

The purpose is to set together single letters when they are separated by hyphens or spaces. I used the following pattern:

(?<!\w\w)(?:\s+|-)(?!\w\w)

However, I have the issue to not apply the same rule with numbers and because \w is including numbers the issue arise. For instance, the following is not allowed and should remain separated in the way it is:

Original                   Regex                    Desired
A 5 M SCHOOL CORPORATION   A5M SCHOOL CORPORATION   A 5 M SCHOOL CORPORATION

Thanks


Solution

  • First of all this (?:\s+|-) could be shortened to [\s-]+ or [ -]+. Second, you need a white list not a black list.

    This means you don't look for (?!\w\w). Instead, you look for (?=\w\b) or specifically (?=[a-zA-Z]\b) in this case.

    Finally, you don't want digits to be matched. So you need to exclude them before matching any [ -]: (?<!\d)[ -]+.

    Putting it all together:

    re.sub(r'(?<!\d)[ -]+(?=[a-zA-Z]\b)', '', str)
    

    See live demo here