Search code examples
pythoncsvcomparison

How to compare csv files and output lines from both files where there are changes?


I'm very new to python and I'm trying to compare 2 csv files with mostly the same information but some lines are completely removed, new, or only have 1 value changed. I need an output file that has the complete line from both the previous and current csv files one on top the other only if there is a change. I need to also add a column at the very front and label the lines according to which file they came from (previous or current).

I have tried using HtmlDiff from difflib, but that doesn't give the information in the format I want and it also shows all the information that did not change. I also tried csv.reader and diff_rows but that was a disaster.

The closest I came to my result was the below, but in the combined file it outputs there is no way for me to know which line is from which file because it does not have labels. Try not to laugh too hard at my code; I'm sure there is a much better way of doing this but I can't figure it out on my own and I'd really appreciate your help.

If I don't define previous and current the second time then the removals output is empty.

previous = open('2019-08-21.csv', 'r', encoding="utf8")
current = open('2019-08-27.csv', 'r', encoding="utf8")

additions = set(current) - set(previous)

with open('Additions Aug 2019.csv', 'w', encoding="utf8") as file_out:
    for line in additions:
        file_out.write(line)

previous = open('2019-08-21.csv', 'r', encoding="utf8")
current = open('2019-08-27.csv', 'r', encoding="utf8")

removals = set(previous) - set(current)

with open('Removals Aug 2019.csv', 'w', encoding="utf8") as file_out:
    for line in removals:
        file_out.write(line)

filenames = ['Additions Aug 2019.csv', 'Removals Aug 2019.csv']
with open('Add, Rem Aug 2019.csv', 'w', encoding="utf8") as outfile:
    for fname in filenames:
        with open(fname) as infile:
            for line in infile:
                outfile.write(line)

previous.close()
current.close()
file_out.close()

Solution

  • I managed to figure out a solution with pandas and will share for anyone else that may need.

    import pandas as pd
    
    previous = open('2019-08-21.csv', 'r', encoding="utf8")
    current = open('2019-08-27.csv', 'r', encoding="utf8")
    
    df_p = pd.read_csv(previous)
    df_p.drop(['Middle Name', 'Date of Birth', 'Place of Birth'], axis=1, inplace=True)
    df_p.insert(0, 'Change Type', "Removed")
    
    df_c = pd.read_csv(current)
    df_c.drop(['Middle Name', 'Date of Birth', 'Place of Birth'], axis=1, inplace=True)
    df_c.insert(0, 'Change Type', "Added")
    
    df_f = df_c.append(df_p)
    df_dedup = df_f.drop_duplicates(subset=['Full Name', 'Country', 'Position'], keep=False)
    
    with open('Aug 2019 Changes.csv', 'w', encoding='utf8') as file_out:
        df_dedup.to_csv(file_out, index=False)