I am trying to parse a large sample of text files with regular expressions (RE). I am trying to extract from these files the part of the text which contains 'vu' and ends with a newline '\n'.
Patterns differ from one file to another, so I tried to look for combinations of RE in my files using the OR operator. However, I did not find a way to automate my code so that the re.findall() function looks for a combination of RE.
Here is an example of how I tried to tackle this issue, but apparently I still can not evaluate both my regular expressions and the OR operator in re.findall():
import re
def series2string(myserie) :
myserie2 = ' or '.join(serie for serie in myserie)
return myserie2
def expression(pattern, mystring) :
x = re.findall(pattern, mystring)
if len(x)>0:
return 1
else:
return 0
#text example
text = "\n\n (troisième chambre)\n i - vu la requête, enregistrée le 28 février 1997 sous le n° 97nc00465, présentée pour m. z... farinez, demeurant ... à dommartin-aux-bois (vosges), par me y..., avocat ;\n"
#expressions to look out
pattern1 = '^\s*vu.*\n'
pattern2 = '^\s*\(\w*\s*\w*\)\s*.*?vu.*\n'
pattern = [pattern1, pattern2]
pattern = series2string(pattern)
expression(pattern, text)
Note : I circumvented this problem by looking for each pattern in a for loop but my code would run faster if I could just use re.findall() once.
Python regular expressions uses the |
operator for alternation.
def series2string(myserie) :
myserie2 = '|'.join(serie for serie in myserie)
myserie2 = '(' + myserie2 + ')'
return myserie2
More information: https://docs.python.org/3/library/re.html
The individual patterns look really messy, so I don't know what is a mistake, and what is intentional. I am guessing you are looking for the word "vu" in a few different contexts.
r
(r'pattern here'
). It allows you to use \
in a pattern without python trying to interpret it as a string escape. It is passed directly to the regex engine. (ref)\s
to match white-space (spaces and line-breaks).(
and )
optional. It can result in catastrophic backtracking, which can make matching large strings really slow.\(?
→ \(
\)?
→ \)
{1}
doesn't do anything. It just repeats the previous sub-pattern once, which is the same as not specifying anything.\br
is invalid. It is interpreted as \b
(ASCII bell-character) + the letter r
.'
) at the beginning of your text-string. Either you intend ^
to match the start of any line, or the '
is a copy/paste error.Some errors when combining the patterns:
pattern = [pattern1, pattern2, pattern3, pattern4]
pattern = series2string(pattern)
expression(re.compile(pattern), text)