Search code examples
pythonsubstringstring-matching

Finding matched string in a set


I'm trying to compare two large files: b.txt contains over 20 millions lines, a.txt contains 50.000 lines. Comparing them in a traditional way takes too much time. For example following code didn't finish in 5 hours:

b = [line.rstrip('\n') for line in open("b.txt")]
a = [line.rstrip('\n') for line in open("a.txt")]
for i in a:
    for j in b:
        if i in j:
            print(j)

I saw following suggestion for a similar question in Stackoverflow:

with open('b.txt') as b:
    blines = set(b)
    print(blines[0])
with open('a.txt') as a:
    for line in a:
        if line in blines:
            print(line)

The set() function works very fast and the code terminates in seconds. However, I need to find the exact string in blines which contains line variable. Since set() is not accessible by index, I couldn't achieve it. Is there a way to find that matching string, or do you have any other suggestions to make this comparasion faster than the first code.


Solution

  • From your comment, you say you need to handle (slightly edited for clarity):

    line = 'abc' with blines = {'abcd', 'fg'}; if line in blines returns true but I need 'abcd' string, or it's index to find it

    What you're doing can't be properly done with a set without combinatoric explosion. You want to handle arbitrary substrings, not just lines or words, which means blines would need to have every substring of every line in it to make the lookup succeed (so instead of just 'abcd' you'd also need to store 'a', 'b', 'c', 'd', 'ab', 'bc'', 'cd', 'abc', and 'bcd', and that's just for one short line, not 20M longer lines).

    A better solution is a data structure that will let you find all your target words in a given string with a data structure that won't suffer combinatoric explosion, e.g. Aho-Corasick, for which a Python package, pyahocorasick, already exists to implement it efficiently.

    Instead of loading all your 20 million lines (and who knows how many substrings of each line) of b into memory, just build an automaton from the 50,000 strings in a, then check each line of b against that automaton:

    import ahocorasick
    auto = ahocorasick.Automaton()    
    with open("a.txt") as needles:
        for needle in needles:
            needle = needle.rstrip('\n')
            auto.add_word(needle, needle)
    auto.make_automaton()
    
    with open("b.txt") as haystacks:
        for lineno, line in enumerate(haystacks):
            for end_ind, found in auto.iter(line):
                print(f"Found {found!r} on line {lineno} at index {end_ind - len(found) + 1}: {line!r}")
    

    That makes a single Aho-Corasick automaton in O(n) time (relative to size of a.txt), then scans it in O(n) time (relative to the size of b.txt); memory usage will be roughly proportionate to the size of a.txt (from prior tests, for 50,000 random needles 3-12 characters long, memory usage for the automaton would likely be in the 10-20 MB range), and doesn't suffer the combinatoric explosion of a set of all substrings. It will find all elements of a wherever they occur in b (even in the middle of a word, as in your example of needing to find abc in abcd) without additional memory overhead.

    If you need to know the line number in a.txt, not just the line number in b.txt, just change the automaton building process to store the line number as well as the needle itself (you can associate anything with each word added, so a tuple is just as good as a str):

    for lineno, needle in enumerate(needles):
        needle = needle.rstrip('\n')
        auto.add_word(needle, (lineno, needle))
    

    then iterate later with:

    for blineno, bline in enumerate(haystacks):
        for end_ind, (alineno, found) in auto.iter(line):
    

    adjusting the output as desired.