Search code examples
performancepython-2.7nested-loops

Make algorithm faster that runs over multiple lists


I am writing some function that has nested-loops and it goes really slow when big lists are involved.

def get_resolved(urllist, generated_urls, layout):
    result = {}
    for url in urllist:
        tmp_result = []
        for gurl in generated_urls[url]:
            if gurl in resolved[layout]:
                tmp_result.append(gurl)
        result[url] = tmp_result
    return result

The I have three lists in this function, a list urllist with about 5000 domain names, a generated_urls list with about 500 000 items which is also just text and then the third list resolved[layout]. This last list comes out of a global dictionary resolved. This one also contains on average 10 000 items.

I want to return a result dictionary which only contains the items out of generated_urls for that specific url that is also in the resolved[layout] list.

The problem is that this nested-loops takes about an hour to execute. This is to slow, because I have to do this for about 30 times or something. I don't see how to make this more performant. Does anyone know how I could do this?

I also run cProfile on this script and this made me see that it was the script above that is so slow. This is the top part of the output:

Sat Nov 29 17:09:10 2014    profile_difflayouts

         2684341 function calls (2684295 primitive calls) in 101.069 seconds

   Ordered by: cumulative time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.006    0.006  101.069  101.069 DiffLayouts.py:1(<module>)
        1    0.001    0.001  101.055  101.055 DiffLayouts.py:13(main)
       18    0.001    0.000   95.898    5.328 DiffLayouts.py:62(process_data)
       36   95.712    2.659   95.712    2.659 DiffLayouts.py:149(get_resolved)
        1    0.001    0.001   79.703   79.703 DiffLayouts.py:30(check_alexa_list_single)
        1    0.000    0.000   16.198   16.198 DiffLayouts.py:42(check_alexa_list_combined)
        3    0.950    0.317    5.152    1.717 DiffLayouts.py:136(filter_domainnames)
  1017314    2.182    0.000    2.182    0.000 {method 'search' of '_sre.SRE_Pattern' objects}
   775796    1.561    0.000    1.561    0.000 {method 'findall' of '_sre.SRE_Pattern' objects}
       75    0.240    0.003    0.240    0.003 {method 'read' of 'file' objects}
       75    0.115    0.002    0.115    0.002 {method 'splitlines' of 'str' objects}

This is actually with some new code, I already tried. With list comprehension, but this only gives me a very small performance gain of about 0,5 %. New version:

def get_resolved(urllist, generated_urls, layout):
    result = {}
    for url in urllist:
        result[url] = [x for x in generated_urls[url] if x in resolved[layout]]
    return result

I hope this is explained enough. Just ask if you don't understand what I'm trying to do here.

Thank you


Solution

  • It seems from the profile that you are spending all your time checking for membership of an element with if x in resolved[layout].

    Now a list is not the most efficient way to store an immutable set of object which just need to support search. Use a set instead. Consider this micro-benchmark:

    import random
    import time
    import sys
    
    size_url      = 10000
    size_resolved = 10000
    
    random.seed(time.time()) 
    
    url = [ random.randint(1,sys.maxint) for x in xrange(size_url)]
    resolved = [ random.randint(1,sys.maxint) for x in xrange(size_resolved)]
    
    a = time.time()
    intersection = [ x for x in url if x in resolved ]
    print "Search in list:",time.time() - a
    
    resolved = set(resolved)
    
    a = time.time()
    intersection = [ x for x in url if x in resolved ]
    print "Search in set:",time.time() - a
    

    This is the output I get on my laptop:

    Search in list: 1.89044713974
    Search in set: 0.00117897987366
    

    Therefore, modify your code in the following way:

    def get_resolved(urllist, generated_urls, layout):
        result = {}
        for url in urllist:
            result[url] = [x for x in generated_urls[url] if x in set(resolved[layout])]
        return result