Search code examples
printingformattingline

Delete lines during file compare without deleting line numbers or injecting new blank lines


file2 has a big list of numbers. File1 has a small list of numbers. file2 is a duplicate of some of the numbers in file1. I want to remove the duplicate numbers in file2 from file1 without deleting any data from file2 but at same time not deleting the line number in file1. I use PyCharm IDE and that assigns the line number. This code does remove the duplicate data from file1 and does not remove the data from file2. Which is what I want, however it is deleting the duplicate numbers and the lines and rewiting them in file1 which is what I don't want to do.

import fileinput

# small file2
with open('file2.txt') as fin:
    exclude = set(line.rstrip() for line in fin)
# big file1
    for line in fileinput.input('file1.txt', inplace=True):
        if line.rstrip() not in exclude:
            print(line)

Example: of what is happening, file2 34344

file-1 at start:
54545
34344
23232
78787

file-1 end:
54545
23232
78787

What I want.

file-1 start:
54545
34344
23232
78787

file-1 end:
54545

23232
78787


Solution

  • You just need to print an empty line when you find a data that is in the exclude set.

    import fileinput
    
    # small file2
    with open('file2.txt') as fin:
        exclude = set(line.rstrip() for line in fin)
    # big file1
        for line in fileinput.input('file1.txt', inplace=True):
            if line.rstrip() not in exclude:
                print(line, end='')
            else:
                print('')    
    

    If file1.txt is:

    54545
    1313
    23232
    13551

    And file2.txt is:

    1313
    13551

    After running the script before file1.txt becomes:

    54545

    23232

    Small note on efficiency

    As you said, this code is in fact rewriting all the lines, those edited and those not. Delete and rewrite only few lines in the middle of a file is not easy, and in any case I am not sure it will be more efficient in your case, as you do not know a priori which lines should be edited: you will always need to read and process the full file line by line to know which lines should be edited. As far as I know, you will hardly find a solution really more efficient than this one. Glad to be denied if anybody knows how.