I want to compare two text files f1.txt, and f2.txt, remove the common words found in both files from f2.txt and sort the new f2.txt in descending order by frequency
My approach:
with open(sys.argv[1]) as f1,open(sys.argv[2]) as f2:
passage = f2.read()
common = f1.read()
words = re.findall(r'\w+', passage)
common_words = re.findall(r'\w+', common)
passage_text = [words.lower() for words in words]
final = set(passage_text) - set(common_words)
word_count = Counter(final)
for word, count in word_count.items():
print(word, ":", count)
I expect the output to be so thing like:
Foo: 12
Bar: 11
Baz: 3
Longword: 1
but I am getting the count frequency for every word to be 1
Your value final
contains only unique words (one for each), that's why the Counter
shows only 1 occurrence. You need to filter passage_text
with this set of words and pass that filtered list to Counter:
import re
from collections import Counter
passage = '''
Foo and Bar and Baz or Longword
Bar or Baz
Foo foo foo
'''
common = '''and or'''
words = re.findall(r'\w+', passage)
common_words = re.findall(r'\w+', common)
passage_text = [words.lower() for words in words]
final_set = set(passage_text) - set(common_words)
word_count = Counter([w for w in passage_text if w in final_set])
for word, count in sorted(word_count.items(), key=lambda k: -k[1]): # or word_count.most_common()
print(word, ":", count)
Prints:
foo : 4
bar : 2
baz : 2
longword : 1