Search code examples
pythonpandascsvconcatenationdifference

Returning differences between two columns in two different files in excel using python


I have two csv files with a common column named 'Name'. File 2 will continuously update and add new values randomly in the column. How can i write a script to compare the two columns and find the differences regardless of where the new values are placed in file2.

Other solutions will find the differences only if the new values are at the end of the column, not randomly within the column.

Code I have tried (only outputs the new values at the bottom of the column, not when it is randomly in the column):

df1 = pd.read_csv('file1.csv')
df2 = pd.read_csv('file2.csv')

new_df = (df1[['Name']].merge(df2[['Name']],on='Name',how = 'outer',indicator = True)
                       .query("_merge != 'both'")
                       .drop('_merge',axis = 1))

new_df.to_csv('file4.csv')

File1:

Name     
gfd454
3v4fd
th678iy

File2:

Name     
gfd454
fght45
3v4fd
th678iy

The output should be:

Name
fght45

Solution

  • # df1 original dataframe of File_1 data
    df1 = pd.DataFrame({'Name':[ 'gfd454' , '3v4fd', 'th678iy']})
    
    # df2 dataframe of changing File_2 data
    df2 = pd.DataFrame({'Name':[ 'gfd454' , 'abcde', 'fght45', '3v4fd', 'abcde' ,'th678iy', 'abcde']})
    
    # Assuming df1 comprises distinct elements and doesn't change, and that
    # df2 contains all elements of df1 and more (the new updates) 
    # df2 may have duplicates like 'abcde'
    
    # Drop duplicates in df2, if df1 has duplicates also drop it first
    # ``keep = first`` : Drop duplicates except for the first occurrence.
    df2.drop_duplicates(keep='first', inplace=True)
    print(df2)
    
    # pandas.concat adds elements of df2 to df1, even if it already exists in df1
    df_concat = pd.concat([df1,df2], join='outer', ignore_index = True)
    print(df_concat)
    

    df_concat

    # now drop the duplicates between df1, df2
    df_diff = df_concat .drop_duplicates(keep=False)
    print(df_diff)
    

    df_diff

    Now, the problem with this is that you have to ensure that df1-df2 = {}, i.e. df1 is subset of df2