I am writing some function that has nested-loops and it goes really slow when big lists are involved.
def get_resolved(urllist, generated_urls, layout):
result = {}
for url in urllist:
tmp_result = []
for gurl in generated_urls[url]:
if gurl in resolved[layout]:
tmp_result.append(gurl)
result[url] = tmp_result
return result
The I have three lists in this function, a list urllist with about 5000 domain names, a generated_urls list with about 500 000 items which is also just text and then the third list resolved[layout]. This last list comes out of a global dictionary resolved. This one also contains on average 10 000 items.
I want to return a result dictionary which only contains the items out of generated_urls for that specific url that is also in the resolved[layout] list.
The problem is that this nested-loops takes about an hour to execute. This is to slow, because I have to do this for about 30 times or something. I don't see how to make this more performant. Does anyone know how I could do this?
I also run cProfile on this script and this made me see that it was the script above that is so slow. This is the top part of the output:
Sat Nov 29 17:09:10 2014 profile_difflayouts
2684341 function calls (2684295 primitive calls) in 101.069 seconds
Ordered by: cumulative time
ncalls tottime percall cumtime percall filename:lineno(function)
1 0.006 0.006 101.069 101.069 DiffLayouts.py:1(<module>)
1 0.001 0.001 101.055 101.055 DiffLayouts.py:13(main)
18 0.001 0.000 95.898 5.328 DiffLayouts.py:62(process_data)
36 95.712 2.659 95.712 2.659 DiffLayouts.py:149(get_resolved)
1 0.001 0.001 79.703 79.703 DiffLayouts.py:30(check_alexa_list_single)
1 0.000 0.000 16.198 16.198 DiffLayouts.py:42(check_alexa_list_combined)
3 0.950 0.317 5.152 1.717 DiffLayouts.py:136(filter_domainnames)
1017314 2.182 0.000 2.182 0.000 {method 'search' of '_sre.SRE_Pattern' objects}
775796 1.561 0.000 1.561 0.000 {method 'findall' of '_sre.SRE_Pattern' objects}
75 0.240 0.003 0.240 0.003 {method 'read' of 'file' objects}
75 0.115 0.002 0.115 0.002 {method 'splitlines' of 'str' objects}
This is actually with some new code, I already tried. With list comprehension, but this only gives me a very small performance gain of about 0,5 %. New version:
def get_resolved(urllist, generated_urls, layout):
result = {}
for url in urllist:
result[url] = [x for x in generated_urls[url] if x in resolved[layout]]
return result
I hope this is explained enough. Just ask if you don't understand what I'm trying to do here.
Thank you
It seems from the profile that you are spending all your time checking for membership of an element with if x in resolved[layout]
.
Now a list is not the most efficient way to store an immutable set of object which just need to support search. Use a set instead. Consider this micro-benchmark:
import random
import time
import sys
size_url = 10000
size_resolved = 10000
random.seed(time.time())
url = [ random.randint(1,sys.maxint) for x in xrange(size_url)]
resolved = [ random.randint(1,sys.maxint) for x in xrange(size_resolved)]
a = time.time()
intersection = [ x for x in url if x in resolved ]
print "Search in list:",time.time() - a
resolved = set(resolved)
a = time.time()
intersection = [ x for x in url if x in resolved ]
print "Search in set:",time.time() - a
This is the output I get on my laptop:
Search in list: 1.89044713974
Search in set: 0.00117897987366
Therefore, modify your code in the following way:
def get_resolved(urllist, generated_urls, layout):
result = {}
for url in urllist:
result[url] = [x for x in generated_urls[url] if x in set(resolved[layout])]
return result