Search code examples
pythonpandascsvdifference

How to compare a specific column in two csv files and output differences to a third file


I have two csv files named test1.csv and test2.csv and they both have a column named 'Name'. I would like to compare each row in this Name column between both files and output the ones that don't match to a third file. I have seen some examples using pandas, but none worked for my situation. Can anyone help me get a script going for this?

Test2 will be updated to include all values from test1 plus new values not included in test1 (which are the ones i want saved to a third file)

An example of what the columns look like is:

test1.csv:

Name     Number    Status
gfd454   456       Disposed
3v4fd    521       Disposed
th678iy  678       Disposed

test2.csv

Name     Number    Status
gfd454   456       Disposed
3v4fd    521       Disposed
th678iy  678       Disposed
vb556h   665       Disposed    

Solution

  • See below.

    The idea is to read the names into s python set data structure and find the new names by doing set substruction.

    1.csv:

    Name Number
    A 12
    B 34 
    C 45 
    

    2.csv

    Name Number
    A 12
    B 34
    C 45
    D 77
    Z 67
    

    The code below will print {'D', 'Z'} which are the new names.

    def read_file_to_set(file_name):
        with open(file_name) as f:
            return set(l.strip().split()[0] for x,l in enumerate(f.readlines()) if x > 0)
    
    
    
    set_1 = read_file_to_set('1.csv')
    set_2 = read_file_to_set('2.csv')
    new_names = set_2 - set_1
    print(new_names)