Search code examples
pythonfile-comparison

Compare two files and remove the words from the second file Python


I'm trying to compare two files and get the difference using a function.

The first file contains English words - one after the other (engwrds.txt) and the second file is a text file of web scraped text (ws.txt). What I want to achieve is to compare the two files and remove the words from ws.txt and write them to a different file.

In the web scraped file, there are words and sentences. But in the other file, the words are placed one after the other.

I tried the following code but it creates a blank output file.

with open('ws.txt', 'r', encoding='utf-8') as file1:
    with open('engwrds.txt', 'r', encoding='utf-8') as file2:
        same = set(file1).intersection(file2)

same.discard('\n')

with open('output_file.txt', 'w', encoding='utf-8') as file_out:
    for line in same:
        file_out.write(line)

Then I tried this one, which doesn't print any output at all.

from pathlib import Path

with open('engwrds.txt', 'r', encoding='utf-8') as fin:
    exclude = set(line.rstrip() for line in fin)

with fileinput.input('ws.txt', inplace=True) as f:
    for line in f:
        if not exclude.intersection(Path(line.rstrip()).parts):
            print(line, end='')

The following code also doesn't print any output.

with open('op11-Copy1.txt', 'r') as file1:
    with open('commonwords.txt', 'r') as file2:
        dif = set(file1).difference(file2)
        
dif.discard('\n')
        
with open('diff.txt', 'w') as file_out:
    for line in dif:
        file_out.write(line)

Can you please explain the mistakes I'm making here? I referred multiple examples like this, this. But I can't figure out the issue. Ideally, I want to come up with a function that achieves this task.

This is what the ws.txt file looks like.
enter image description here

This is what the engwrds.txt looks like.
enter image description here

The output file looks like this.
enter image description here


Solution

  • Just open your files in different variables and compare them. For Example:

    Suppose that the file ws.txt (scraped file) contains:

    your world is beautiful

    And the file engwrds.txt contains these words (one after the other):

    while world want wild

    Open each one in a different variable:

    with open('engwrds.txt', 'r', encoding='utf-8') as file:
        engwrds = file.read()
    
    with open('ws.txt', 'r', encoding='utf-8') as file:
        ws = file.read()
    

    From here engwrds and ws are strings, so you can compare them in many different ways:

    differences = set(engwrds.split()).symmetric_difference(set(ws.split()))
    print(differences)
    
    Output: {'beautiful', 'is', 'want', 'while', 'wild', 'your'}
    

    Obviously, this comparison only works if your words are separated by spaces, but from here you will have a better idea of how to solve the problem.