This question is very similar to: this post, but I couldn't find the answer there.
# 0m2.676s
#
# note that it should be == instead of in
# but as far as timing goes, it would be even faster
# (thanks to @Booboo for pointing this)
#
# here: ------------++
# ||
if any("xmuijdswly" in w for w in data):
print("FOUND IT")
is much faster than:
# 0m13.476s
for d in data:
if "xmuijdswly" == d:
print("FOUND IT")
break
my data contains 10^7
arbitrary strings of average length 30
EDIT: the time difference exists in both linux and windows:
PS ... > Measure-Command { python test_any.py }
TotalSeconds : 4.5402383
PS ...> Measure-Command { python test_for.py }
TotalSeconds : 17.7107506
EDIT: generating random strings program (for completeness)
$ cat main.c
#include <stdio.h>
#include <stdlib.h>
int main(int argc, char **argv)
{
FILE *fl=fopen("input.txt", "w+t");
for (int i=0;i<10000000;i++)
{
int length = 5 + rand() % 50;
for (int j=0;j<length;j++)
fprintf(fl, "%c", rand() % 26 + 'a');
fprintf(fl, "\n");
}
fclose(fl);
return 0;
}
First, your two benchmarks are performing different tests. The first one, with the for loop, is matching a string against a list of strings by testing for equality. The second one, is using the in
operator testing whether one string is a substring of another and would be a more expensive operation than testing for equality. But even if we change the second benchmark to instead test for equality, which should make it run even faster, I find the results to be the opposite of what you found, i.e. using any
runs more slowly than the for loop. In the following demo I have arranged it that both benchmarks must iterate through all 10_000_000 strings because a match will never be found:
import time
data = ['-' * 30 for _ in range(10_000_000)]
def test1():
for d in data:
if 'ABCDEFGHIJKLMNOPQRSTUVWXYZ0123' == d:
return True
return False
def test2():
return any('ABCDEFGHIJKLMNOPQRSTUVWXYZ0123' == w for w in data)
t = time.time()
b = test1()
print(b, time.time() - t)
t = time.time()
b = test2()
print(b, time.time() - t)
Prints:
False 0.2701404094696045
False 0.5590765476226807
The second version using any
runs twice as slowly (perhaps you should re-check your results). Why does the version using any
actually run more slowly?
The argument to any
is a generator expression, which results in creating a generator function. So to test each value of the data a function call to generate the next comparison result must be made and that function call is what is making things run more slowly. We can turn your any
version into one that uses an explicit generator function:
import time
data = ['-' * 30 for _ in range(10_000_000)]
def test1():
for d in data:
if 'ABCDEFGHIJKLMNOPQRSTUVWXYZ0123' == d:
return True
return False
def generator():
for w in data:
yield 'ABCDEFGHIJKLMNOPQRSTUVWXYZ0123' == w
def test2():
return any(generator())
t = time.time()
b = test1()
print(b, time.time() - t)
t = time.time()
b = test2()
print(b, time.time() - t)
Prints:
False 0.28400230407714844
False 0.5430142879486084
As you can see, the time required for both any
versions is essentially the same, as I would expect.
Update: Using the OP's C Program to Generate the Data
Here I used the OP's C program to generate the data and I have used the same "xmuijdswly" string for the comparison:
import time
total_length = 0
data = []
with open('input.txt') as f:
for line in f:
line = line.strip()
total_length += len(line)
data.append(line)
print('Average string length:', total_length / len(data))
def test1():
for d in data:
if 'xmuijdswly' == d:
return True
return False
def test2():
return any('xmuijdswly' == w for w in data)
t = time.time()
b = test1()
print(b, 'Time using for loop:', time.time() - t)
t = time.time()
b = test2()
print(b, 'Time using any:', time.time() - t)
Prints:
Average string length: 29.4972984
False Time using for loop: 0.3110032081604004
False Time using any: 0.6610157489776611
One Possible Explanation for OP's Results
The OP hasn't actually posted the complete programs that they were testing against, which presumably have to first read in the data from input.txt. If the any
version was run second, the data in input.txt would have been cached by the operating system as a result of the looping version having read the file and therefore the I/O time to read the input data would be much less for the any
version.
Was the time to do that initial creation of the data
list erroneously included in the benchmark timings?