Search code examples
pythonregexor-operator

Python regular expression using the OR operator


I am trying to parse a large sample of text files with regular expressions (RE). I am trying to extract from these files the part of the text which contains 'vu' and ends with a newline '\n'.

Patterns differ from one file to another, so I tried to look for combinations of RE in my files using the OR operator. However, I did not find a way to automate my code so that the re.findall() function looks for a combination of RE.

Here is an example of how I tried to tackle this issue, but apparently I still can not evaluate both my regular expressions and the OR operator in re.findall():

import re

def series2string(myserie) :
    myserie2 = ' or '.join(serie for serie in myserie)
    return myserie2

def expression(pattern, mystring) : 
    x = re.findall(pattern, mystring)
    if len(x)>0:
        return 1
    else:
        return 0

#text example
text = "\n\n    (troisième chambre)\n    i - vu la requête, enregistrée le 28 février 1997 sous le n° 97nc00465, présentée pour m. z... farinez, demeurant ... à dommartin-aux-bois (vosges), par me y..., avocat ;\n"

#expressions to look out
pattern1 = '^\s*vu.*\n'
pattern2 = '^\s*\(\w*\s*\w*\)\s*.*?vu.*\n'

pattern = [pattern1, pattern2]
pattern = series2string(pattern)

expression(pattern, text)

Note : I circumvented this problem by looking for each pattern in a for loop but my code would run faster if I could just use re.findall() once.


Solution

  • Python regular expressions uses the | operator for alternation.

    def series2string(myserie) :
        myserie2 = '|'.join(serie for serie in myserie)
        myserie2 = '(' + myserie2 + ')'
        return myserie2
    

    More information: https://docs.python.org/3/library/re.html


    The individual patterns look really messy, so I don't know what is a mistake, and what is intentional. I am guessing you are looking for the word "vu" in a few different contexts.

    1. Always use Python raw strings for regular expressions, prefixed with r (r'pattern here'). It allows you to use \ in a pattern without python trying to interpret it as a string escape. It is passed directly to the regex engine. (ref)
    2. Use \s to match white-space (spaces and line-breaks).
    3. Since you already have several alternative patterns, don't make ( and ) optional. It can result in catastrophic backtracking, which can make matching large strings really slow.
      \(?\(
      \)?\)
    4. {1} doesn't do anything. It just repeats the previous sub-pattern once, which is the same as not specifying anything.
    5. \br is invalid. It is interpreted as \b (ASCII bell-character) + the letter r.
    6. You have a quote character (') at the beginning of your text-string. Either you intend ^ to match the start of any line, or the ' is a copy/paste error.
    7. Some errors when combining the patterns:

      pattern = [pattern1, pattern2, pattern3, pattern4]
      pattern = series2string(pattern)
      
      expression(re.compile(pattern), text)