Search code examples
pythonpython-3.xcsvcompare

How to compare 2 different csv files and output the differences


I have 2 CSVs which are New.csv and Old.csv that are have around 1K rows and 10 columns that has a structure like this:

enter image description here

If there is a longName (first column) in in the new.csv that is not in the old.csv, I would like that entire new.csv row to be appended to the changes.csv.

I started off by doing this but it does not work well at all:

def deltaFileMaker():
    with open('Old.csv', 'r', encoding='utf-8') as t1, open('New.csv', 'r', encoding='utf-8') as t2:
        fileone = t1.readlines()
        filetwo = t2.readlines()

    with open('changes.csv', 'w', encoding='utf-8') as outFile:
        for line in filetwo:
            if line not in fileone:
                outFile.write(line)



deltaFileMaker()

I also tried to use csv-diff but I could not find a way to convert its output to a csv file

Update

def deltaFileMaker():
    from csv_diff import load_csv, compare
    diff = compare(
        load_csv(open("old.csv",encoding="utf8"), key="longName"),
        load_csv(open("new.csv",encoding="utf8"), key="longName")
    )

        with open('changes.csv', 'w',encoding="utf8") as f:  
        w = csv.DictWriter(f, diff.keys())
        w.writeheader()
        w.writerow(diff)


deltaFileMaker()


Doing this: enter image description here


Solution

  • Have you looked at csv-diff? Their website has an example that might be suitable:

    from csv_diff import load_csv, compare
    diff = compare(
        load_csv(open("one.csv"), key="id"),
        load_csv(open("two.csv"), key="id")
    )
    

    This should return a dict object, which you can parse into a CSV file. To parse that dict into rows, this is an example. Note: getting the changes to write correctly is difficult, but this is more of a proof-of-concept - modify as you wish

    from csv_diff import load_csv, compare
    from csv import DictWriter
    
    # Get all the row headers across all the changes
    headers = set({'change type'})
    for key, vals in diff.items():
        for val in vals: # Multiple of the same difference 'type'
            headers = headers.union(set(val.keys()))
    
    # Write changes to file
    with open('changes.csv', 'w', encoding='utf-8') as fh:
        w = DictWriter(fh, headers)
        w.writeheader()
        for key, changes in diff.items():
            for val in changes: # Add each instance of this type of change
                val.update({'change type': key}) # Add 'change type' data
                w.writerow(val)
    

    For the file one.csv:

    id,     name, age
     1,     Cleo,   4
     2, Pancakes,   2
    

    and two.csv:

    id,   name, age
     1,   Cleo,   5
     3, Bailey,   1
    4,  Elliot,  10
    

    Running this produces:

    change type,     name, id,               changes, age, key
          added,   Bailey,  3,                      ,   1,
          added,   Elliot,  4,                      ,  10,
        removed, Pancakes,  2,                      ,   2,
        changed,         ,   , "{'age': ['4', '5']}",    ,   1
    

    So not great for all changes, but works really well for the added/removed rows.