I need to parse specific strings from a free text field in an .xlsx file. I am using Python 2.7 in Spyder.
I escaped the '.' in the regex formulas but I am still getting the same error.
To do that, I used pandas to convert the .xslx file into a pandas dataframe:
data = "complaints_data.xlsx"
read_data = pd.read_excel(data)
read_data.dropna(inplace = False)
df = pd.DataFrame(read_data)
df['FMEA Assessment'] = df['FMEA Assessment'].replace({',':''}, regex=True)
Then, I used the extract function of pandas to extract my string fields FMEA, Rev and Line using regex patterns.
fmea_pattern = r'(FMEA\s*\d*\d*\d*\d*\d*|fmea\s*\d*\d*\d*\d*\d*|DOC\s*\-*[0]\d*\d*\d*\d*\d*|doc\s*\-*[0]\d*\d*\d*\d*\d*)'
df[['FMEA']] = df['FMEA Assessment'].str.extract(fmea_pattern, expand=True)
rev_pattern = r'(Rev\.*\s+\D{1,2}+|rev\.*\s+\D{1,2}|REV\.*\s+\D{1,2}|rev\.*\s+\D{1,2})'
df[['REV']] = df['FMEA Assessment'].str.extract(rev_pattern, expand=True)
line_pattern = r'(line item\.*\s*\:*\d{1,3}\d*\.*\D*\.*\d+\d*?\.|Line\.*\s*\:*\d{1,3}\d*\.*\D*\.*\d+\d*?\.|lines\.*\s*\:*\d{1,3}\d*\.*\D*\.*\d+\d*?\.|Lines\.*\s*\:*\d{1,3}\d*\.*\D*\.*\d+\d*?\.|Line item\.*\s*\:*\d{1,3}\d*\.*\D*\.*\d+\d*?\.|LINES\.*\s*\:*\d{1,3}\d*\.*\D*\.*\d+\d*?\.|LINE\.*\s*\:*\d{1,3}\d*\.*\D*\.*\d+\d*?\.)'
df[['LINE']] = df['FMEA Assessment'].str.extract(line_pattern, expand=True)
The string fields I need to parse can be inputted in various ways and I accounted for each way in the regex formulas and for each variation of a word; for example, I accounted for line, Line, LINE, lines, Lines, etc. I have tested the regex formulas individually and separately and they are working properly. However, when I combine all of them in the code above, I get the following error message:
Also, is there another way to account for variations of the same word at the same time(lower case, upper case and title case)?
Python 3.11 and newer
re
library now supports possessive qunatifiers similarly to PCRE, and this issue no longer exists.
*+, ++, ?+
Like the*
,+
, and?
quantifiers, those where+
is appended also match as many times as possible. However, unlike the true greedy quantifiers, these do not allow back-tracking when the expression following it fails to match. These are known as possessive quantifiers. For example,a*a
will matchaaaa
because thea*
will match all 4a
s, but, when the finala
is encountered, the expression is backtracked so that in the end thea*
ends up matching 3a
s total, and the fourtha
is matched by the finala
. However, whena*+a
is used to matchaaaa
, thea*+
will match all 4a
, but when the finala
fails to find any more characters to match, the expression cannot be backtracked and will thus fail to match.x*+
,x++
andx?+
are equivalent to(?>x*)
,(?>x+)
and(?>x?)
correspondingly.New in version 3.11.
and
{m,n}+
Causes the resulting RE to match from m to n repetitions of the preceding RE, attempting to match as many repetitions as possible without establishing any backtracking points. This is the possessive version of the quantifier above. For example, on the 6-character string'aaaaaa'
,a{3,5}+aa
attempt to match 5'a'
characters, then, requiring 2 more'a'
s, will need more characters than available and thus fail, whilea{3,5}aa
will match witha{3,5}
capturing 5, then 4'a'
s by backtracking and then the final 2'a'
s are matched by the finalaa
in the pattern.x{m,n}+
is equivalent to(?>x{m,n})
.New in version 3.11.
Python versions below 3.11
The main error in this case is due to the fact you are using a possessive quantifier instead of a regular, non-possessive quantifier.
It is a common mistake when users test their patterns in the online PCRE regex testers. You need to make sure you ALWAYS test your regexps in the environment (or with a regex engine option) that is compatible with your target environment.
Python re
does not support possessive quantifiers:
{5}+
{5,}+
{5,10}+
++
?+
*+
In this case, you just need to remove the trailing +
from \D{1,2}+
:
rev_pattern = r'(Rev\.*\s+\D{1,2}|rev\.*\s+\D{1,2}|REV\.*\s+\D{1,2}|rev\.*\s+\D{1,2})'
It seems you may just use
rev_pattern = r'((?:[Rr]ev|REV)\.*\s+\D{1,2})' # Will only match Rev, REV and rev at the start
rev_pattern = r'(?i)(Rev\.*\s+\D{1,2})' # Will match any case variations of Rev
See the regex demo at Regex101, note the Python
option selected on the left.
Also, note that it is possible to make the whole pattern case insensitive by adding (?i)
at the start of the pattern, or by compiling the regex with re.I
or re.IGNORECASE
arguments. This will "account for variations of the same word at the same time(lower case, upper case and title case)".
NOTE: if you actually are looking to use a possessive quantifier you may emulate a possessive quantifier with the help of a positive lookahead and a backreference. However, in Python, you would need re.finditer
to get access to the whole match values.