Search code examples
pythonregexregex-lookaroundsregex-greedy

Re module and positive look behind of variable width


I am new to programming and Python, so I apologize if this is an obvious question. I tried looking at similar questions on this website, but the solutions seem to be outside of my reach.

Problem: Consider the following text:

12/19 Paul 1/20

1/20 Jacob 10/2

Using the module re, extract the names from the above. In other words, your output should be:

['Paul', 'Jacob']

First, I tried using positive look arounds. I tried:

import re

name_regex=re.compile(r'''(
(?<=\d{1,2}/\d{1,2}\s)      #looks for one or two digits followed by a forward slash followed by one or two digits, followed by a space
.*?                        #looks for anything besides the newline in a non-greedy manner (is the non-greedy part necessary? I am not sure...)
(?=\s\d{1,2}/\d{1,2})  #looks for a space followed by one or two digits followed by a forward slash followed by one or two digits
)''', re.VERBOSE)

text=str("12/19 Paul 1/20\n1/20 Jacob 10/2")
print(name_regex.findall(text))

However, the above yields the error:

re.error: look-behind requires fixed-width pattern

From reading similar questions, I believe that this means that look arounds cannot have variable length (i.e., they cannot look for "1 or 2 digits").

However, how can I fix this?

Any help would be greatly appreciated. Especially the help suited for nearly a complete beginner like me!

PS. Ultimately, the list of names surrounded by dates can be very long. The dates can have one or two digits that are separated by a slash. I just wanted to give a minimal working example.

Thank you!


Solution

  • If you want to match at least a single non whitespace char between the digit patterns, you might use

    (?<=\d{1,2}/\d{1,2}\s)\S.*?(?=\s\d{1,2}/\d{1,2})
    

    This part \S.*? will match a non whitespace char followed by any char except a newline non greedy so it will match until asserting the first occurrence of (?=\s\d{1,2}/\d{1,2})

    Python demo

    Note that if you would use .*? then match would also return an empty entry ['Paul', '', 'Jacob'] , see this example.


    You could also use a capturing group instead of lookarounds:

    \d{1,2}/\d{1,2}\s(\S.*?)\s\d{1,2}/\d{1,2}
    

    Regex demo