I've tried to find the correct code and have written multiple attempts at this myself but am hoping that someone here can help.
I have a small file and a large file: Small: 90,000 lines Large: 1,200,000 lines
I'm trying to print any line from the large file to a result file if that line from the large file contains any line from the small file. Also I should add to lower case as that should not matter.
Example:
Large file line: '{"data": "SN@StackOverflow"}'
Small file line (ANY): '@stackoVerfloW'
-> Prints large line.
This should yield a print of the large line to the result file. Note that the small files line would be a subset of the larger files line. This is the goal as I am not looking for matching lines but rather lines that contain data from the small file and then saving those large file lines as results.
I've tried to do this with a nested for loop however it is not printing any of the results. Also I loaded the files in as lists hoping to reduce time as the lines will be stored in memory. It is also not possible for me to sort these lines as the lines are not uniform.
results = open("results.txt", "w")
small = []
large = []
for line in open('small.txt'):
small.append(line)
for line in open('large.txt'):
large.append(line)
for j in large:
for i in small:
if i in j:
results.write(j + '\n')
Please let me know if there is anything I can do to clarify my issue and sorry this is my first post and I hope to write better questions in the future.
The problem there can be though -
as you put it, the naive approach of trying to match all 90000 lines against each of the 1,200,000 lines on the other file, and even while at that, performing an 'contains' (Python in
) operator on the large file line, will lead to a prohibitive processing cost. You are basically talking about M x N x O operations, M being the large file size, N being the small file, and O the average length of lines in the larger file (minus average length of lines in the small file) - those are 1 trillion operations to start with - with computer operating in the GHz range, it could be feasible in a few hours, if the small file can fit into memory.
A smarter alternative would use location-independent hashes for each line in the large file. Location independent hashes can be complicated - but a few hashes of all possible word subset in the wider line can be matched against a dictionary containing all the 90.000 lines in the smaller file in constant time O(1) - doing it once for each of the 1,200,000 lines can be done in linear time -reducing this search to a few seconds from hours or days, just using string normalization and python dictionaries.
In the end, this should be all the needed code:
import re
def normalize(text):
text = text.strip().lower()
# strip punctuation
text = re.sub('[^\w\ \d]', ' ', text)
words = tuple(text.split())
return words
def normalized_subsets(text):
words = normalize(text)
combinations = set()
for i in range(len(words)):
for j in range(i + 1, len(words) + 1):
combinations.add(words[i: j])
return combinations
def main():
small = {normalize(line): linenum for
linenum, line in enumerate(open("small.txt")) if line.strip()}
with open("large.txt") as large, open("results.txt", "w") as results:
for line_num, line in large:
for combination in combinations(line):
if combination in small:
results.write(f"{linenum} {small[combination]} {line}")
main()
So, while you still see nested for
loops in this code, the nested versions only loops through the possible subsets of words in each line - for a line with 30 words, that will be less than 500 combinations - we make 500 comparisons to match any of these word-subgroup in the 90000 smaller file dictionary, as opposed to 90.000 comparisons.
So, it is still a quadratic algorithm in the end, but should be sharply faster - (for the sample line in the question, after removing the punctuation, it will try a match for every element in
{('data',),
('data', 'sn'),
('data', 'sn', 'stackoverflow'),
('sn',),
('sn', 'stackoverflow'),
('stackoverflow',)}
Which is just 6 comparisons (down for a linear search in 90000 lines)
(For greater value this code is also recording the line numbers in the large file and on the smaller file at the beggining of the match line in the results file)