Search code examples
pythonsubstringstring-matching

Python: optimal search for substring in list of strings


I have a particular problem where I want to search for many substrings in a list of many strings. The following is the gist of what I am trying to do:

listStrings = [ACDE, CDDE, BPLL, ... ]

listSubstrings = [ACD, BPI, KLJ, ...]

The above entries are just examples. len(listStrings) is ~ 60,000, len(listSubstrings) is ~50,000-300,000, and len(listStrings[i]) is anywhere from 10 to 30,000.

My current Python attempt is:

for i in listSubstrings:
   for j in listStrings:
       if i in j:
          w.write(i+j)

Or something along these lines. While this works for my task, it's horribly slow, using one core and taking on the order of 40 minutes to complete the task. Is there a way to speed this up?

I don't believe that I can make a dict out of listStrings:listSubstrings because there is the possibility of duplicate entries which need to be stored on both ends (although I may try this if I can find a way to append a unique tag to each one, since dicts are so much faster). Similarly, I don't think I can pre-compute possible substrings. I don't even know if searching dict keys is faster than searching a list (since dict.get() is going to give the specific input and not look for sub-inputs). Is searching lists in memory just that slow relatively speaking?


Solution

  • Maybe you can try to chunk one of the two list (the biggest ? although intuitively I would cut listStrings) in smaller ones then use threading to run these search in parallel (the Pool class of multiprocessing offers a convenient way to do this) ? I had some significant speed-up using something like :

    from multiprocessing import Pool
    from itertools import chain, islice
    
    # The function to be run in parallel :
    def my_func(strings):
        return [j+i for i in strings for j in listSubstrings if i.find(j)>-1]
    
    # A small recipe from itertools to chunk an iterable :
    def chunk(it, size):
        it = iter(it)
        return iter(lambda: tuple(islice(it, size)), ())
    
    # Generating some fake & random value :
    from random import randint
    listStrings = \
        [''.join([chr(randint(65, 90)) for i in range(randint(1, 500))]) for j in range(10000)]
    listSubstrings = \
        [''.join([chr(randint(65, 90)) for i in range(randint(1, 100))]) for j in range(1000)]
    
    # You have to prepare the searches to be performed:
    prep = [strings for strings in chunk(listStrings, round(len(listStrings) / 8))]
    with Pool(4) as mp_pool:
        # multiprocessing.map is a parallel version of map()
        res = mp_pool.map(my_func, prep)
    # The `res` variable is a list of list, so now you concatenate them
    # in order to have a flat result list
    result = list(chain.from_iterable(res))
    

    Then you could write the whole result variable (instead of writing it line by lines) :

    with open('result_file', 'w') as f:
        f.write('\n'.join(result))
    

    Edit 01/05/18: flatten the result using itertools.chain.from_iterable instead of a ugly workaround using map side-effects, following ShadowRanger's advice.