Make algorithm faster that runs over multiple lists

I am writing some function that has nested-loops and it goes really slow when big lists are involved.

def get_resolved(urllist, generated_urls, layout):
    result = {}
    for url in urllist:
        tmp_result = []
        for gurl in generated_urls[url]:
            if gurl in resolved[layout]:
                tmp_result.append(gurl)
        result[url] = tmp_result
    return result

The I have three lists in this function, a list urllist with about 5000 domain names, a generated_urls list with about 500 000 items which is also just text and then the third list resolved[layout]. This last list comes out of a global dictionary resolved. This one also contains on average 10 000 items.

I want to return a result dictionary which only contains the items out of generated_urls for that specific url that is also in the resolved[layout] list.

The problem is that this nested-loops takes about an hour to execute. This is to slow, because I have to do this for about 30 times or something. I don't see how to make this more performant. Does anyone know how I could do this?

I also run cProfile on this script and this made me see that it was the script above that is so slow. This is the top part of the output:

Sat Nov 29 17:09:10 2014    profile_difflayouts

         2684341 function calls (2684295 primitive calls) in 101.069 seconds

   Ordered by: cumulative time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.006    0.006  101.069  101.069 DiffLayouts.py:1(<module>)
        1    0.001    0.001  101.055  101.055 DiffLayouts.py:13(main)
       18    0.001    0.000   95.898    5.328 DiffLayouts.py:62(process_data)
       36   95.712    2.659   95.712    2.659 DiffLayouts.py:149(get_resolved)
        1    0.001    0.001   79.703   79.703 DiffLayouts.py:30(check_alexa_list_single)
        1    0.000    0.000   16.198   16.198 DiffLayouts.py:42(check_alexa_list_combined)
        3    0.950    0.317    5.152    1.717 DiffLayouts.py:136(filter_domainnames)
  1017314    2.182    0.000    2.182    0.000 {method 'search' of '_sre.SRE_Pattern' objects}
   775796    1.561    0.000    1.561    0.000 {method 'findall' of '_sre.SRE_Pattern' objects}
       75    0.240    0.003    0.240    0.003 {method 'read' of 'file' objects}
       75    0.115    0.002    0.115    0.002 {method 'splitlines' of 'str' objects}

This is actually with some new code, I already tried. With list comprehension, but this only gives me a very small performance gain of about 0,5 %. New version:

def get_resolved(urllist, generated_urls, layout):
    result = {}
    for url in urllist:
        result[url] = [x for x in generated_urls[url] if x in resolved[layout]]
    return result

I hope this is explained enough. Just ask if you don't understand what I'm trying to do here.

Thank you

Solution

It seems from the profile that you are spending all your time checking for membership of an element with if x in resolved[layout].

Now a list is not the most efficient way to store an immutable set of object which just need to support search. Use a set instead. Consider this micro-benchmark:

import random
import time
import sys

size_url      = 10000
size_resolved = 10000

random.seed(time.time()) 

url = [ random.randint(1,sys.maxint) for x in xrange(size_url)]
resolved = [ random.randint(1,sys.maxint) for x in xrange(size_resolved)]

a = time.time()
intersection = [ x for x in url if x in resolved ]
print "Search in list:",time.time() - a

resolved = set(resolved)

a = time.time()
intersection = [ x for x in url if x in resolved ]
print "Search in set:",time.time() - a

This is the output I get on my laptop:

Search in list: 1.89044713974
Search in set: 0.00117897987366

Therefore, modify your code in the following way:

def get_resolved(urllist, generated_urls, layout):
    result = {}
    for url in urllist:
        result[url] = [x for x in generated_urls[url] if x in set(resolved[layout])]
    return result