Search code examples
pythonoptimizationtext-processing

Processing through files in Python


Good morning fellow coders. I have a question regarding the use of finding specific lines in a file using Python. One method is to use if line.startswith(word) or to use if not line.startswith(word) and then continue. What's the difference between the two? Is one better than the other for larger text programs? Which one should I make a habit of using for the future? Ex: better processing speed, less load on the components. I've tested it on smaller programs and there is almost no difference in runtime.

fhand = open('mbox-short.txt')
for line in fhand:
    line = line.rstrip()
    if line.startswith('From:'):
        print(line)

VS

fhand = open('mbox-short.txt')
for line in fhand:
    line = line.rstrip()
    if not line.startswith('From:'):
        continue
    print(line)

Thank you for your answers.

Test Run:

import time
t0= time.time()
fhand = open('mbox-short.txt')
for line in fhand:
   line = line.rstrip()
   if line.startswith('From:'):
       print(line)
t1 = time.time() - t0
print ('Time Elapsed:',t1)

Run time speeds 1: Time Elapsed: 0.013382196426391602

Run time speeds 2: Time Elapsed: 0.0033702850341796875

Run time speeds 3: Time Elapsed: 0.0040471553802490234

import time
t0= time.time()
fhand = open('mbox-short.txt')
for line in fhand:
    line = line.rstrip()
    if not line.startswith('From:'):
        continue
    print(line)
t1 = time.time() - t0
print ('Time Elapsed:',t1)

Run time speeds 1: Time Elapsed: 0.0037872791290283203

Run time speeds 2: Time Elapsed: 0.003139495849609375

Run time speeds 3: Time Elapsed: 0.0030825138092041016


Solution

  • There is no measurable difference between these two styles in terms of performance.

    It is a matter of coding style and readability.

    In this particular example, in my opinion the code is perfectly understandable in both forms.

    Personally, I prefer the variant with "continue".

    This coding style can be called "early return" (or, I guess, in this case, "early continue"). The idea is that you get the irrelevant cases out of the way first, and then process the relevant ones. See this question of Software Engineerin for a discussion of the "early return" pattern.

    If you use "continue" early to ignore the irrelevant lines, then the added benefit is that the code that works on the relevant lines is indented one less indentation level (it's more to the left of the screen, helping readability).