I have two big unstructured text files which can NOT be fitted into the memory. I want to find the common words between them.
What would be the most efficient (time and space) way?
Thanks
I gave this two files:
pi_poem
Now I will a rhyme construct
By chosen words the young instruct
I do not like green eggs and ham
I do not like them Sam I am
pi_prose
The thing I like best about pi is the magic it does with circles.
Even young kids can have fun with the simple integer approximations.
The code is simple. The first loop reads the first file, line by line, sticking the words into a lexicon set. The second loop reads the second file; each word it finds in the first file's lexicon goes into a set of common words.
Does that do what you need? You'll need to adapt it for punctuation, and you'll probably want to remove the extra printing once you have it changed over.
lexicon = set()
with open("pi_poem", 'r') as text:
for line in text.readlines():
for word in line.split():
if not word in lexicon:
lexicon.add(word)
print lexicon
common = set()
with open("pi_prose", 'r') as text:
for line in text.readlines():
for word in line.split():
if word in lexicon:
common.add(word)
print common
Output:
set(['and', 'am', 'instruct', 'ham', 'chosen', 'young', 'construct', 'Now', 'By', 'do', 'them', 'I', 'eggs', 'rhyme', 'words', 'not', 'a', 'like', 'Sam', 'will', 'green', 'the'])
set(['I', 'the', 'like', 'young'])