Search code examples
python-3.xword-count

Compare, remove, and count words in Text file


I want to compare two text files f1.txt, and f2.txt, remove the common words found in both files from f2.txt and sort the new f2.txt in descending order by frequency

My approach:

  1. Make a list of words fro both f1.txt and f2.txt.
  2. Remove unwanted characters from the text input.
  3. Compare the two list and remove common words from the list generated from f2.txt
  4. sorting words in the list generated from f2.txt by frequency
with open(sys.argv[1]) as f1,open(sys.argv[2]) as f2:
    passage = f2.read()
    common = f1.read()
words = re.findall(r'\w+', passage)
common_words = re.findall(r'\w+', common)
passage_text = [words.lower() for words in words]
final = set(passage_text) - set(common_words)
word_count = Counter(final)
for word, count in word_count.items():
    print(word, ":", count)

I expect the output to be so thing like:

Foo:          12
Bar:          11
Baz:           3
Longword:      1

but I am getting the count frequency for every word to be 1


Solution

  • Your value final contains only unique words (one for each), that's why the Counter shows only 1 occurrence. You need to filter passage_text with this set of words and pass that filtered list to Counter:

    import re
    from collections import Counter
    
    passage = '''
        Foo and Bar and Baz or Longword
        Bar or Baz
        Foo foo foo
    '''
    
    common = '''and or'''
    
    words = re.findall(r'\w+', passage)
    common_words = re.findall(r'\w+', common)
    passage_text = [words.lower() for words in words]
    final_set = set(passage_text) - set(common_words)
    word_count = Counter([w for w in passage_text if w in final_set])
    for word, count in sorted(word_count.items(), key=lambda k: -k[1]): # or word_count.most_common()
        print(word, ":", count)
    

    Prints:

    foo : 4
    bar : 2
    baz : 2
    longword : 1