Search code examples
pythonpython-3.xpandascsvfile-manipulation

Python Pandas: Compare two CSV files and delete lines from both the file by matching a column


We need to delete lines from both of the files if the value of the first column is not present in the other file.

Lets consider two CSV files:

file1.csv:
yrdi_391    111    1.11    1.0    1.1    111.0
yfyrn_9132  222    2.22    2.0    2.2    222.0
kdkfke_392  999    9.99    9.0    9.9    999.0
hfeisk_3    333    3.33    3.0    3.3    333.0

file2.csv:
yrdi_391    444    4.44    4.0    4.4    444.0
yfyrn_9132  555    5.55    5.0    5.5    555.0
hfeisk_3    666    6.66    6.0    6.6    666.0
fhedn_271   888    8.88    8.0    8.8    888.0

Now, we need to delete entire line starting with kdkfke_392 from the file1.csv as it's not present therein file2.csv.

On the other hand, we need to delete the entire line starting with fhedn_271, as it's not present in file1.csv.

Expected result:

file1.csv:
yrdi_391    111    1.11    1.0    1.1    111.0
yfyrn_9132  222    2.22    2.0    2.2    222.0
hfeisk_3    333    3.33    3.0    3.3    333.0

file2.csv:
yrdi_391    444    4.44    4.0    4.4    444.0
yfyrn_9132  555    5.55    5.0    5.5    555.0
hfeisk_3    666    6.66    6.0    6.6    666.0

As of now, the lines in file1.csv and file2.csv are not sorted. If required, we may do the sorting first and then apply the deletion.

Pandas CVS related manipulations are preferred as in both of the files we have headers and need to keep them.

Newbie in python scripting!

Any help will be highly appreciated!


Solution

  • You can use isin().

    print (df)
    
                0    1     2    3    4      5
    0    yrdi_391  111  1.11  1.0  1.1  111.0
    1  yfyrn_9132  222  2.22  2.0  2.2  222.0
    2  kdkfke_392  999  9.99  9.0  9.9  999.0
    3    hfeisk_3  333  3.33  3.0  3.3  333.0
    
    print (df1)
    
                0    1     2    3    4      5
    0    yrdi_391  444  4.44  4.0  4.4  444.0
    1  yfyrn_9132  555  5.55  5.0  5.5  555.0
    2    hfeisk_3  666  6.66  6.0  6.6  666.0
    3   fhedn_271  888  8.88  8.0  8.8  888.0
    

    csv_df = df[df[0].isin(df1[0])]
    
    print (csv_df)
                0    1     2    3    4      5
    0    yrdi_391  111  1.11  1.0  1.1  111.0
    1  yfyrn_9132  222  2.22  2.0  2.2  222.0
    3    hfeisk_3  333  3.33  3.0  3.3  333.0
    
    csv_df1 = df1[df1[0].isin(df[0])]
    
    print (csv_df1)
                0    1     2    3    4      5
    0    yrdi_391  444  4.44  4.0  4.4  444.0
    1  yfyrn_9132  555  5.55  5.0  5.5  555.0
    2    hfeisk_3  666  6.66  6.0  6.6  666.0
    
    csv_df.to_csv('temp.csv', index=False)
    csv_df1.to_csv('temp1.csv', index=False)