Search code examples
pythonregexdata-extraction

best way to extract data using re.compiler


I need to extract (a lot of) info from different text files. I wonder if there is a shorter and more efficient way than the following:

First part: (N lines long)

N1 = re.compile(r'')
N2 = re.compile(r'')
.
Nn = re.compile(r'')

Second part: (2N lines long)

with open(filename) as f:
  for line in f:
    if N1.match(line):
      var1 = N1.match(line).group(x).strip()
    elif N2.match(line):
      var2 = N1.match(line).group(x).strip()
    elif Nn.match(line):
      varn = Nn

Do you recommend having the re.compile vars (part 1) separate from the part 2. What do you people use in this cases? Perhaps a function pasing the regex as argument? and call it every time.

In my case N is 30, meaning I have 90 lines for feeding a dictionary with very little, or no logic at all.


Solution

  • I’m going to attempt to answer this without really knowing what you are actually doing there. So this answer might help you, or it might not.

    First of all, what re.compile does is pre-compile a regular expression, so you can use it later and do not have to compile it every time you use it. This is primarily useful when you have a regular expression that is used multiple times throughout your program. But if the expression is only used a few times, then there is not really that much of a benefit to compiling it up front.

    So you should ask yourself, how often the code runs that attempts to match all those expressions. Is it just once during the script execution? Then you can make your code simpler by inlining the expressions. Since you’re running the matches for each line in a file, pre-compiling likely makes sense here.

    But just because you pre-compiled the expression, that does not mean that you should be sloppy and match the same expression too often. Look at this code:

    if N1.match(line):
        var1 = N1.match(line).group(x).strip()
    

    Assuming there is a match, this will run N1.match() twice. That’s an overhead you should avoid since matching expressions can be relatively expensive (depending on the expression), even if the expression is already pre-compiled.

    Instead, just match it once, and then reuse the result:

    n1_match = N1.match(line)
    if n1_match:
        var1 = n1_match.group(x).strip()
    

    Looking at your code, your regular expressions also appear to be mutally exclusive—or at least you only ever use the first match and skip the remaining ones. In that case, you should make sure that you order your checks so that the most common checks are done first. That way, you avoid running too many expressions that won’t match anyway. Also, try to order them so that more complex expressions are ran less often.

    Finally, you are collecting the match result in separate variables varN. At this point, I’m questioning what exactly you are doing there, since after all your if checks, you do not have a clear way of figuring out what the result was and which variable to use. At this point, it might make more sense to just collect it in a single variable, or to move specific logic within the condition bodies. But it’s difficult to tell with the amount of information you gave.