Search code examples
regexpython-3.xregex-lookarounds

Regex match characters when not preceded by a string


I am trying to match spaces just after punctuation marks so that I can split up a large corpus of text, but I am seeing some common edge cases with places, titles and common abbreviations:

I am from New York, N.Y. and I would like to say hello! How are you today? I am well. I owe you $6. 00 because you bought me a No. 3 burger. -Sgt. Smith

I am using this with the re.split function in Python 3 I want to get this:

["I am from New York, N.Y. and I would like to say hello!",
"How are you today?",
"I am well.",
"I owe you $6. 00 because you bought me a No. 3 burger."
"-Sgt. Smith"]

This is currently my regex:

(?<=[\.\?\!])(?<=[^A-Z].)(?<=[^0-9].)(?<=[^N]..)(?<=[^o].)

I decided to try to fix the No. first, with the last two conditions. But it relies on matching the N and the o independently which I think is going to case false positives elsewhere. I cannot figure out how to get it to make just the string No behind the period. I will then use a similar approach for Sgt. and any other "problem" strings I come across.

I am trying to use something like:

(?<=[\.\?\!])(?<=[^A-Z].)(?<=[^0-9].)^(?<=^No$)

But it doesn't capture anything after that. How can I get it to exclude certain strings which I expect to have a period in it, and not capture them?

Here is a regexr of my situation: https://regexr.com/4sgcb


Solution

  • Doing it with only one regex will be tricky - as stated in comments, there are lots of edge cases.

    Myself I would do it with three steps:

    1. Replace spaces that should stay with some special character (re.sub)
    2. Split the text (re.split)
    3. Replace the special character with space

    For example:

    import re
    
    zero_width_space = '\u200B'
    
    s = 'I am from New York, N.Y. and I would like to say hello! How are you today? I am well. I owe you $6. 00 because you bought me a No. 3 burger. -Sgt. Smith'
    
    s = re.sub(r'(?<=\.)\s+(?=[\da-z])|(?<=,)\s+|(?<=Sgt\.)\s+', zero_width_space, s)
    s = re.split(r'(?<=[.?!])\s+', s)
    
    from pprint import pprint
    pprint([line.replace(zero_width_space, ' ') for line in s])
    

    Prints:

    ['I am from New York, N.Y. and I would like to say hello!',
     'How are you today?',
     'I am well.',
     'I owe you $6. 00 because you bought me a No. 3 burger.',
     '-Sgt. Smith']