Free text parsing using long regex formula leading to error: multiple repeat in python?

I need to parse specific strings from a free text field in an .xlsx file. I am using Python 2.7 in Spyder.

I escaped the '.' in the regex formulas but I am still getting the same error.

To do that, I used pandas to convert the .xslx file into a pandas dataframe:

data = "complaints_data.xlsx"
read_data = pd.read_excel(data)
read_data.dropna(inplace = False)
df = pd.DataFrame(read_data)
df['FMEA Assessment'] = df['FMEA Assessment'].replace({',':''}, regex=True)

Then, I used the extract function of pandas to extract my string fields FMEA, Rev and Line using regex patterns.

fmea_pattern = r'(FMEA\s*\d*\d*\d*\d*\d*|fmea\s*\d*\d*\d*\d*\d*|DOC\s*\-*[0]\d*\d*\d*\d*\d*|doc\s*\-*[0]\d*\d*\d*\d*\d*)'
df[['FMEA']] = df['FMEA Assessment'].str.extract(fmea_pattern, expand=True)
    
rev_pattern = r'(Rev\.*\s+\D{1,2}+|rev\.*\s+\D{1,2}|REV\.*\s+\D{1,2}|rev\.*\s+\D{1,2})'
df[['REV']] = df['FMEA Assessment'].str.extract(rev_pattern, expand=True)
    


line_pattern = r'(line item\.*\s*\:*\d{1,3}\d*\.*\D*\.*\d+\d*?\.|Line\.*\s*\:*\d{1,3}\d*\.*\D*\.*\d+\d*?\.|lines\.*\s*\:*\d{1,3}\d*\.*\D*\.*\d+\d*?\.|Lines\.*\s*\:*\d{1,3}\d*\.*\D*\.*\d+\d*?\.|Line item\.*\s*\:*\d{1,3}\d*\.*\D*\.*\d+\d*?\.|LINES\.*\s*\:*\d{1,3}\d*\.*\D*\.*\d+\d*?\.|LINE\.*\s*\:*\d{1,3}\d*\.*\D*\.*\d+\d*?\.)'
df[['LINE']] = df['FMEA Assessment'].str.extract(line_pattern, expand=True)

The string fields I need to parse can be inputted in various ways and I accounted for each way in the regex formulas and for each variation of a word; for example, I accounted for line, Line, LINE, lines, Lines, etc. I have tested the regex formulas individually and separately and they are working properly. However, when I combine all of them in the code above, I get the following error message:

Error Message Screenshot

Also, is there another way to account for variations of the same word at the same time(lower case, upper case and title case)?

Solution

Python 3.11 and newer

re library now supports possessive qunatifiers similarly to PCRE, and this issue no longer exists.

*+, ++, ?+
Like the *, +, and ? quantifiers, those where + is appended also match as many times as possible. However, unlike the true greedy quantifiers, these do not allow back-tracking when the expression following it fails to match. These are known as possessive quantifiers. For example, a*a will match aaaa because the a* will match all 4 as, but, when the final a is encountered, the expression is backtracked so that in the end the a* ends up matching 3 as total, and the fourth a is matched by the final a. However, when a*+a is used to match aaaa, the a*+ will match all 4 a, but when the final a fails to find any more characters to match, the expression cannot be backtracked and will thus fail to match. x*+, x++ and x?+ are equivalent to (?>x*), (?>x+) and (?>x?) correspondingly.

New in version 3.11.

and

{m,n}+
Causes the resulting RE to match from m to n repetitions of the preceding RE, attempting to match as many repetitions as possible without establishing any backtracking points. This is the possessive version of the quantifier above. For example, on the 6-character string 'aaaaaa', a{3,5}+aa attempt to match 5 'a' characters, then, requiring 2 more 'a's, will need more characters than available and thus fail, while a{3,5}aa will match with a{3,5} capturing 5, then 4 'a's by backtracking and then the final 2 'a's are matched by the final aa in the pattern. x{m,n}+ is equivalent to (?>x{m,n}).

New in version 3.11.

Python versions below 3.11

The main error in this case is due to the fact you are using a possessive quantifier instead of a regular, non-possessive quantifier.

It is a common mistake when users test their patterns in the online PCRE regex testers. You need to make sure you ALWAYS test your regexps in the environment (or with a regex engine option) that is compatible with your target environment.

Python re does not support possessive quantifiers:

{5}+
{5,}+
{5,10}+
++
?+
*+

In this case, you just need to remove the trailing + from \D{1,2}+:

rev_pattern = r'(Rev\.*\s+\D{1,2}|rev\.*\s+\D{1,2}|REV\.*\s+\D{1,2}|rev\.*\s+\D{1,2})'

It seems you may just use

rev_pattern = r'((?:[Rr]ev|REV)\.*\s+\D{1,2})' # Will only match Rev, REV and rev at the start
rev_pattern = r'(?i)(Rev\.*\s+\D{1,2})' # Will match any case variations of Rev

See the regex demo at Regex101, note the Python option selected on the left.

Also, note that it is possible to make the whole pattern case insensitive by adding (?i) at the start of the pattern, or by compiling the regex with re.I or re.IGNORECASE arguments. This will "account for variations of the same word at the same time(lower case, upper case and title case)".

NOTE: if you actually are looking to use a possessive quantifier you may emulate a possessive quantifier with the help of a positive lookahead and a backreference. However, in Python, you would need re.finditer to get access to the whole match values.