I have two csv files with a common column named 'Name'. File 2 will continuously update and add new values randomly in the column. How can i write a script to compare the two columns and find the differences regardless of where the new values are placed in file2.
Other solutions will find the differences only if the new values are at the end of the column, not randomly within the column.
Code I have tried (only outputs the new values at the bottom of the column, not when it is randomly in the column):
df1 = pd.read_csv('file1.csv')
df2 = pd.read_csv('file2.csv')
new_df = (df1[['Name']].merge(df2[['Name']],on='Name',how = 'outer',indicator = True)
.query("_merge != 'both'")
.drop('_merge',axis = 1))
new_df.to_csv('file4.csv')
File1:
Name
gfd454
3v4fd
th678iy
File2:
Name
gfd454
fght45
3v4fd
th678iy
The output should be:
Name
fght45
# df1 original dataframe of File_1 data
df1 = pd.DataFrame({'Name':[ 'gfd454' , '3v4fd', 'th678iy']})
# df2 dataframe of changing File_2 data
df2 = pd.DataFrame({'Name':[ 'gfd454' , 'abcde', 'fght45', '3v4fd', 'abcde' ,'th678iy', 'abcde']})
# Assuming df1 comprises distinct elements and doesn't change, and that
# df2 contains all elements of df1 and more (the new updates)
# df2 may have duplicates like 'abcde'
# Drop duplicates in df2, if df1 has duplicates also drop it first
# ``keep = first`` : Drop duplicates except for the first occurrence.
df2.drop_duplicates(keep='first', inplace=True)
print(df2)
# pandas.concat adds elements of df2 to df1, even if it already exists in df1
df_concat = pd.concat([df1,df2], join='outer', ignore_index = True)
print(df_concat)
# now drop the duplicates between df1, df2
df_diff = df_concat .drop_duplicates(keep=False)
print(df_diff)
Now, the problem with this is that you have to ensure that df1-df2 = {}, i.e. df1 is subset of df2