Search code examples
pythonprofilingany

Why is Python any( pred for _ in _ ) much faster than for loop


This question is very similar to: this post, but I couldn't find the answer there.

# 0m2.676s
#
# note that it should be == instead of in
# but as far as timing goes, it would be even faster
# (thanks to @Booboo for pointing this)
#
# here: ------------++
#                   ||
if any("xmuijdswly" in w for w in data):
    print("FOUND IT")

is much faster than:

# 0m13.476s
for d in data:
    if "xmuijdswly" == d:
        print("FOUND IT")
        break

my data contains 10^7 arbitrary strings of average length 30

EDIT: the time difference exists in both linux and windows:

PS ... > Measure-Command { python test_any.py }
TotalSeconds      : 4.5402383

PS ...> Measure-Command { python test_for.py }
TotalSeconds      : 17.7107506

EDIT: generating random strings program (for completeness)

$ cat main.c
#include <stdio.h>
#include <stdlib.h>

int main(int argc, char **argv)
{
    FILE *fl=fopen("input.txt", "w+t");
    for (int i=0;i<10000000;i++)
    {
        int length = 5 + rand() % 50;
        for (int j=0;j<length;j++)
            fprintf(fl, "%c", rand() % 26 + 'a');
        fprintf(fl, "\n");
    }
    fclose(fl);
    return 0;
}

Solution

  • First, your two benchmarks are performing different tests. The first one, with the for loop, is matching a string against a list of strings by testing for equality. The second one, is using the in operator testing whether one string is a substring of another and would be a more expensive operation than testing for equality. But even if we change the second benchmark to instead test for equality, which should make it run even faster, I find the results to be the opposite of what you found, i.e. using any runs more slowly than the for loop. In the following demo I have arranged it that both benchmarks must iterate through all 10_000_000 strings because a match will never be found:

    import time
    
    data = ['-' * 30 for _ in range(10_000_000)]
    
    def test1():
        for d in data:
            if 'ABCDEFGHIJKLMNOPQRSTUVWXYZ0123' == d:
                return True
        return False
    
    def test2():
        return any('ABCDEFGHIJKLMNOPQRSTUVWXYZ0123' == w for w in data)
    
    t = time.time()
    b = test1()
    print(b, time.time() - t)
    
    t = time.time()
    b = test2()
    print(b, time.time() - t)
    

    Prints:

    False 0.2701404094696045
    False 0.5590765476226807
    

    The second version using any runs twice as slowly (perhaps you should re-check your results). Why does the version using any actually run more slowly?

    The argument to any is a generator expression, which results in creating a generator function. So to test each value of the data a function call to generate the next comparison result must be made and that function call is what is making things run more slowly. We can turn your any version into one that uses an explicit generator function:

    import time
    
    data = ['-' * 30 for _ in range(10_000_000)]
    
    def test1():
        for d in data:
            if 'ABCDEFGHIJKLMNOPQRSTUVWXYZ0123' == d:
                return True
        return False
    
    def generator():
        for w in data:
            yield 'ABCDEFGHIJKLMNOPQRSTUVWXYZ0123' == w
    
    def test2():
        return any(generator())
    
    t = time.time()
    b = test1()
    print(b, time.time() - t)
    
    t = time.time()
    b = test2()
    print(b, time.time() - t)
    

    Prints:

    False 0.28400230407714844
    False 0.5430142879486084
    

    As you can see, the time required for both any versions is essentially the same, as I would expect.

    Update: Using the OP's C Program to Generate the Data

    Here I used the OP's C program to generate the data and I have used the same "xmuijdswly" string for the comparison:

    import time
    
    total_length = 0
    data = []
    with open('input.txt') as f:
        for line in f:
            line = line.strip()
            total_length += len(line)
            data.append(line)
    print('Average string length:', total_length / len(data))
    
    def test1():
        for d in data:
            if 'xmuijdswly' == d:
                return True
        return False
    
    def test2():
        return any('xmuijdswly' == w for w in data)
    
    t = time.time()
    b = test1()
    print(b,  'Time using for loop:', time.time() - t)
    
    t = time.time()
    b = test2()
    print(b, 'Time using any:', time.time() - t)
    

    Prints:

    Average string length: 29.4972984
    False Time using for loop: 0.3110032081604004
    False Time using any: 0.6610157489776611
    

    One Possible Explanation for OP's Results

    The OP hasn't actually posted the complete programs that they were testing against, which presumably have to first read in the data from input.txt. If the any version was run second, the data in input.txt would have been cached by the operating system as a result of the looping version having read the file and therefore the I/O time to read the input data would be much less for the any version.

    Was the time to do that initial creation of the data list erroneously included in the benchmark timings?