Search code examples
pythonregexperformancetimeformat

Why does the basic string formatting outperforms the built-in re.compile() function?


So I used dis library (disassembler for Python bytecode) to understand what causes such an interesting performance difference which is observed at different scales, even with 1000s of strings unlike the one string example below. However, I am still clueless.

I used both standard time library and timeit for the calculations.

import timeit

def find_overlapped_instances(regexx, string):
    found_any = False
    for start in range(len(string)):
        if(found_any):#addition
            break
        for end in range(start, len(string)):
            cur = string[start:end+1]
            if re.match(regex, cur):
                found_any = True
                break
    return found_any

#regex = re.compile(r'F.{4}D.AX')
regex = r'^{}$'.format(r'F.{4}D.AX') 


fasta_file = "voodoo.fasta"

total_time = 0
num_hits = 0
for record in SeqIO.parse(fasta_file, "fasta"):
    sequence = str(record.seq)
    starttime = timeit.default_timer()
    found_1 = find_overlapped_instances(regex, sequence)
    total_time += timeit.default_timer()-starttime
    if(found_1):
        num_hits += 1
print(num_hits)
print(total_time)    

Formatting seems to affect the performance in search (which is the main and nearly only source of computation) in a good way which somehow makes a ~25% performance difference in general:

Searching the pattern in n strings re.compile() r'^{}$'.format()
n = 200 64.3 seconds 47.1 seconds
n = 400 136.7 seconds 101.7 seconds

Can someone help me understand why this might be happening?


Solution

  • Apart from the fact that the test should use the same regex (i.e. both with or without the ^ and $ anchors), the main problem is that when you have the compiled regular expression object, you should not pass it as first argument to re.match, but call match on that object. Only then you will benefit from that compilation.

    Here is a simpler test that shows this difference, using three scenarios:

    1. Calling match on the compiled regular expression object (fastest)
    2. Calling re.match with the pattern string as argument (medium)
    3. Calling re.match with the compiled regular expression object as argument (worst)

    Here is a script for testing that:

    import re
    import timeit
    
    pattern = r'F.{4}D.AX'
    compiled = re.compile(pattern)
    s = "qsdf F9999D9AX los F999D99AX"
    
    def test1():
        compiled.match(s)
    
    def test2():
        re.match(pattern, s)
    
    def test3():
        re.match(compiled, s)
    
    for test in (test1, test2, test3):   
        print(timeit.timeit(test, number=1000000))
    

    Here is the output I got:

    0.7618284979998862
    2.8180372250008077
    4.692441613999108