I'm trying to compare two large files: b.txt
contains over 20 millions lines, a.txt
contains 50.000 lines. Comparing them in a traditional way takes too much time. For example following code didn't finish in 5 hours:
b = [line.rstrip('\n') for line in open("b.txt")]
a = [line.rstrip('\n') for line in open("a.txt")]
for i in a:
for j in b:
if i in j:
print(j)
I saw following suggestion for a similar question in Stackoverflow:
with open('b.txt') as b:
blines = set(b)
print(blines[0])
with open('a.txt') as a:
for line in a:
if line in blines:
print(line)
The set()
function works very fast and the code terminates in seconds. However, I need to find the exact string in blines
which contains line
variable. Since set()
is not accessible by index, I couldn't achieve it. Is there a way to find that matching string, or do you have any other suggestions to make this comparasion faster than the first code.
From your comment, you say you need to handle (slightly edited for clarity):
line = 'abc'
withblines = {'abcd', 'fg'}
; ifline in blines
returns true but I need'abcd'
string, or it's index to find it
What you're doing can't be properly done with a set
without combinatoric explosion. You want to handle arbitrary substrings, not just lines or words, which means blines
would need to have every substring of every line in it to make the lookup succeed (so instead of just 'abcd'
you'd also need to store 'a'
, 'b'
, 'c'
, 'd'
, 'ab'
, 'bc'
', 'cd'
, 'abc'
, and 'bcd'
, and that's just for one short line, not 20M longer lines).
A better solution is a data structure that will let you find all your target words in a given string with a data structure that won't suffer combinatoric explosion, e.g. Aho-Corasick, for which a Python package, pyahocorasick
, already exists to implement it efficiently.
Instead of loading all your 20 million lines (and who knows how many substrings of each line) of b
into memory, just build an automaton from the 50,000 strings in a
, then check each line of b
against that automaton:
import ahocorasick
auto = ahocorasick.Automaton()
with open("a.txt") as needles:
for needle in needles:
needle = needle.rstrip('\n')
auto.add_word(needle, needle)
auto.make_automaton()
with open("b.txt") as haystacks:
for lineno, line in enumerate(haystacks):
for end_ind, found in auto.iter(line):
print(f"Found {found!r} on line {lineno} at index {end_ind - len(found) + 1}: {line!r}")
That makes a single Aho-Corasick automaton in O(n)
time (relative to size of a.txt
), then scans it in O(n)
time (relative to the size of b.txt
); memory usage will be roughly proportionate to the size of a.txt
(from prior tests, for 50,000 random needles 3-12 characters long, memory usage for the automaton would likely be in the 10-20 MB range), and doesn't suffer the combinatoric explosion of a set
of all substrings. It will find all elements of a
wherever they occur in b
(even in the middle of a word, as in your example of needing to find abc
in abcd
) without additional memory overhead.
If you need to know the line number in a.txt
, not just the line number in b.txt
, just change the automaton building process to store the line number as well as the needle itself (you can associate anything with each word added, so a tuple
is just as good as a str
):
for lineno, needle in enumerate(needles):
needle = needle.rstrip('\n')
auto.add_word(needle, (lineno, needle))
then iterate later with:
for blineno, bline in enumerate(haystacks):
for end_ind, (alineno, found) in auto.iter(line):
adjusting the output as desired.