Search code examples
pythonlist

Efficiently find matching string from substrings in large lists


I have two lists containing about 5 million items each. List_1 is a list of tuples, with two strings per tuple. List_2, is a long list of strings.

I am trying to find a compound string, made from those tuples, in List_2.

So if the tuple from List_1 is ("foo", "bar"), and List_2 contains ["flub", "blub", "barstool", "foo & bar: misadventures in python"], I would be trying to to fetch "foo & bar: misadventures in python" from List_2.

The way that I currently do it is by iterating through List_1, and comprehension to scan through List_2. While the search through List_2 is fast, taking about a second to execute, it would need to iterate through all of List_1, and therefore requires an inordinate amount of time (the better part of 1000 hours) to complete, which made me wonder if there was a faster, more efficient way to do the same thing.

Code Example:

list_1 = [] #Insert List
list_2 = [] #Insert List

for search_term in list_1:
    compound_string = "{search_first} & {search_second}".format(search_first=search_term[0], search_second=search_term[1])
    result = next((s for s in list_2 if compound_string in s), None) #Short-circuit, so we don't need to search through the whole list
    if result:
        #do exciting things

I looked into using a set and intersection to perform the comparison, however, using a set intersection to do the comparison only works with whole strings. As I do not know the whole string ahead of time, using that method doesn't seem feasible without using a for loop and lists, which would run into the same problem.


Solution

  • The problem seems somewhat under-specified, but also over-specified. For example, what do tuples really have to do with it?

    I'll presume to reframe it like so: you have a list of strings, needles, and another list of strings, haystacks. You want to find all the haystacks that contain a (at least one) needle.

    First thing that comes to mind then is to preprocess the needles, to build a trie structure that allows searching for any of them more efficiently. Then march over the haystacks, one at a time, using that structure to test them.

    Here's simple code off the top of my head. It doesn't sound like RAM will be a problem for you, but if it is there are fancier ways to build "compressed" tries. BTW, if it's the case that all your needles contain the 3-character substring " & ", then best guess is that most haystacks won't, so you could get out cheap in most cases by checking for just that much first.

    from collections import defaultdict
    
    class Node:
        __slots__ = 'final', 'ch2node'
    
        def __init__(self):
            self.final = False
            self.ch2node = defaultdict(Node)
    
    def add(trie, s):
        for ch in s:
            trie = trie.ch2node[ch]
        trie.final = True
    
    # Does s[i:] start with a string in the trie?
    def find(trie, s, i):
        for i in range(i, len(s)):
            if trie.final:
                return True
            ch = s[i]
            if ch in trie.ch2node:
                trie = trie.ch2node[ch]
            else:
                return False
        return trie.final
    
    def search(trie, s):
        return any(find(trie, s, i) for i in range(len(s)))
    
    needles = ["a & b", "kik", "c & as"]
    haystacks = ["sldjkfa & b", "c&as", "akiko", "xc & asx", "kkc & a"]
    
    root = Node()
    for n in needles:
        add(root, n)
    print(list(h for h in haystacks if search(root, h)))
    

    Which prints

    ['sldjkfa & b', 'akiko', 'xc & asx']

    EDIT

    A comment mentioned the Aho-Corasick algorithm, which is roughly related to the simple trie code above, but fancier and more efficient (it effectively searches "everywhere in the haystack simultaneously").

    I haven't yet used it, but there's what looks like a capable Python package for that available on PyPI.

    EDIT2

    I'm trying to get you unstuck, not give you a theoretically optimal solution. Try stuff! You may be surprised at how well even the simple code I gave may work for you.

    I fiddled the above to create 5 million "needles", each composed of 2 dictionary words (each at least 10 letters) separated by a single space. Building the trie took under 45 seconds (Python 3.12.4). Checking 5_008_510 lines against them took another 55 seconds, well under 2 minutes from start to finish. Contrast with "the better part of 1000 hours" you think you're facing now.

    This with no attempt to "optimize" anything, beyond just using a dirt simple trie.

    If I were to pursue it, I'd look first at memory use rather than speed. This consumed about 8.2GB of peak RAM. One way to cut that is to post-process the trie, to delete the empty dict on final nodes (or to not allocate a dict at all unless it's needed). But that would complicate the code some. Another is to look at using byte strings instead of Unicode strings. Then there's gonzo absurdities, like not using a Node class at all, instead using raw 2-tuples or 2-lists.

    But given all you said about your problem, it would be "good enough for me" already.

    TRADEOFFS

    The great advantage of a trie is that it's insensitive to how many needles there are - the time to check a haystack is about the same whether there's one or a billion needles to look for.

    The great potential disadvantage is the memory needed to hold a needle trie. 5 million needles is certainly on the large side, which is why I used as simple a trie structure as possible.

    The tradeoff there is that for a haystack of length L, it may need to do L distinct searches. The related Aho-Corasick automaton only needs to do one search, regardless of how large L is. But that's a fancier trie variant that requires more memory and hairier code.

    In the absence of any info about the distribution of your haystack (or even needle) sizes, "first try the simplest thing that could possibly work" rules. The potential quadratic (in L) time nature of the dirt-simple-trie search would kill it if, e.g., L could be as large as a million - but is a relatively minor thing if L won't get bigger than, say, 100 (being 100 times slower than theoretically necessary just doesn't matter much compared to saving a factor of 5 million).

    Sketching all the possible tradeoffs would require a small book. To get more focused suggestions, you need to give quantified details about your expected needles and haystacks.

    In case it's not obvious, here's a pragmatic thing: if the RAM for a needle trie is "too large", you can shave it by about a factor of K, by doing K runs, using only len(needles)/K needles per run. In that case, the needles should be sorted first (common prefixes are physically shared in a trie, and sorting will bring needles with common prefixes together).

    Or you can do a lot more work to build a disk-based trie.

    The possible solution space is large.

    QUICK COMPARISON

    As above, 5 million needles, but I cut their average size about in half, from around 21 characters to about 10.5 - RAM pressure otherwise with the other package. Somewhat over 5 million haystacks. Most had len under 70, but a few in the hundreds. Few haystacks contained a needle (only 910 total).

    For other code, I used ahocorapy, a pure-Python implementation of full-blown Aho-Corasick.

    Memory use for the other package was significantly higher. Expected. Where my Node class contains only 2 members, its similar State class contains 7. It needs to save away a lot more info to guarantee worst-case linear-time performance.

    For the same reason, building the needle trie was also slower. About 24 seconds for my code, about 60 for ahocorapy.

    But you get what you pay for ;-) Searching the 5M+ haystacks took about 55 seconds for my code, but only about 22 for ahocorapy. Since needles were rarely found, this is close to a worst case for my code (it has to try len(haystack) distinct searches to conclude that no needles are present).

    In all, my code was slightly faster overall, thanks to it doing much less work to build its dirt-dumb needle trie to begin with.

    Under the PyPy implementation of Python, all this code runs at least twice as fast.

    And with either code base, pickle could be used to save away the needle trie for reuse on a different collection of haystacks.

    ANOTHER STAB

    Here's new "dirt dumb" code, running faster, using less memory, better structured, and generalized to give the possibility of finding all contained needles. For terrible cases, consider a set of needles like

    x
    ax
    aax
    aaax
    aaaax
    aaaaax
    ...
    aaaaa.....aaax
    

    and a haystack like aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaay. There are no matches, but one-at-a-time seaarching will take a long time. Full Aho-Corasick much faster.

    from collections import defaultdict
    
    class Node:
        __slots__ = 'final', 'ch2node'
    
        def __init__(self):
            self.final = ""
            self.ch2node = None
    
    class Trie:
        def __init__(self):
            self.root = Node()
            self.leaders = set()
    
        # Add a needle.
        def add(self, s):
            if not s:
                raise ValueError("empty string not allowed")
            trie = self.root
            for ch in s:
                if (ch2node := trie.ch2node) is None:
                    ch2node = trie.ch2node = defaultdict(Node)
                trie = ch2node[ch]
            trie.final = s
            self.leaders.add(s[0])
    
        # Search a haystack `s`. Generates a 2-tuple, (needle, i), for each needle the
        # haystack contains. The needle starts at s[i].
        def search(self, s):
            leaders = self.leaders
            root = self.root
            for i, ch in enumerate(s):
                if ch not in leaders:
                    continue
                trie = root
                for i in range(i, len(s)):
                    if ((ch2node := trie.ch2node)
                          and (ch := s[i]) in ch2node):
                        trie = ch2node[ch]
                        if trie.final:
                            yield trie.final, i - len(trie.final) + 1
                    else:
                        break
    

    Then, e.g.,

    t = Trie()
    for n in 'eat', 'tea', 'car', 'care', 'cares', 'arc':
        t.add(n)
    
    for m in t.search('eateacarcares'):
        print(m)
    

    displays:

    ('eat', 0)
    ('tea', 2)
    ('car', 5)
    ('arc', 6)
    ('car', 8)
    ('care', 8)
    ('cares', 8)
    

    RABIN-KARP

    A different approach is to use the Rabin-Karp (RK) algorithm to match multiple needles. A particular hash code is precomputed for each needle, and all the storage needed is a dict mapping a hash code to a list of the needles with that hash code.

    Searching a haystack is just a matter of going over it once, left to right, computing RK's "rolling hash" for each window. The hash function is designed so that the next window's hash can be computed using a small & fixed number (independent of window size) of arithmetic operations. If the hash is in the dict, directly compare each needle with that hash against the haystack window.

    It's simple, fast, and very memory-frugal. BUT. Alas, it's only straightforward if all the needles have the same length, so that the window size is fixed. If there are K different needle lengths, it gets messy. You can, e.g., use K different rolling hashes, but that slows it by a factor of K. Or you can use a window size equal to the length of the shortest needle, but then the number of false positives can zoom.

    I used the latter strategy, and couldn't make it competitive under CPython. However, PyPy excels at speeding Python code doing arithmetic on native machine-size ints, and speed was comparable under that. Memory use was 10x smaller, a massive win. RK is also good at avoiding pathological cases, although that's probabilistic, not guaranteed.

    FOR SHORT--BUT NOT TOO SHORT--NEEDLES

    Now that we know your needles are neither tiny nor huge, this should yield a nice speedup, and requires vastly less memory:

    EDIT: added leaders, which gives a highly significant speedup on my data.

    EDIT: in searching, .startswith() avoids needing to construct a new string object.

    class RK: # not Rabin-Karp, but inspired by it
        def __init__(self):
            from collections import defaultdict
            self.needles = []
            self.h2ns = defaultdict(list)
            self.leaders = set()
    
        def add(self, s):
            if not s:
                raise ValueError("empty string not allowed")
            if self.h2ns:
                raise ValueError("cannot add() after finalize() called")
            self.needles.append(s)
            self.leaders.add(s[0])
    
        def finalize(self):
            h2ns = self.h2ns
            if h2ns:
                return # already finalized
            w = self.w = min(map(len, self.needles))
            for n in self.needles:
                h2ns[hash(n[:w])].append(n)
            del self.needles
    
        def search(self, s):
            if not self.h2ns:
                raise ValueError("must call finalize() before search()")
            h2nsget = self.h2ns.get
            w = self.w
            leaders = self.leaders
            for i in range(len(s) - w + 1):
                if (s[i] in leaders
                      and (ns := h2nsget(hash(s[i : i + w])))):
                    for n in ns:
                        if s.startswith(n, i):
                            yield n, i
    

    Then, e.g.,

    t = RK()
    for n in 'eat', 'tea', 'car', 'care', 'cares', 'arc', 's':
        t.add(n)
    t.finalize()
    
    for m in t.search('eateacarcares'):
        print(m)
    

    prints

    ('eat', 0)
    ('tea', 2)
    ('car', 5)
    ('arc', 6)
    ('car', 8)
    ('care', 8)
    ('cares', 8)
    ('s', 12)
    

    The searching part isn't actually faster, but finalize() is very much faster than building a trie. The length of the shortest needle is vital: the shorter it is, the more likely searching will have to weed out "false positives".

    On my same test data, total time from start to finish is about 15 seconds with this.

    Variant: why hash at all? h2ns could instead map a w-character string to the needles starting with that string. Hashing would still occur, of course, but under the covers, as part of Python doing the dict lookup. It makes little difference to speed, but boosts memory use (w-character dict keys require more space than machine int keys). That in turn can be reduced by storing needle[w:] in the lists instead, for a possible reduction in total character storage needed. But that makes finalize() slower - slicing isn't free. All such variants look "good enough" to me.