We need to delete lines from both of the files if the value of the first column is not present in the other file.
Lets consider two CSV files:
file1.csv:
yrdi_391 111 1.11 1.0 1.1 111.0
yfyrn_9132 222 2.22 2.0 2.2 222.0
kdkfke_392 999 9.99 9.0 9.9 999.0
hfeisk_3 333 3.33 3.0 3.3 333.0
file2.csv:
yrdi_391 444 4.44 4.0 4.4 444.0
yfyrn_9132 555 5.55 5.0 5.5 555.0
hfeisk_3 666 6.66 6.0 6.6 666.0
fhedn_271 888 8.88 8.0 8.8 888.0
Now, we need to delete entire line starting with kdkfke_392
from the file1.csv
as it's not present therein file2.csv
.
On the other hand, we need to delete the entire line starting with fhedn_271
, as it's not present in file1.csv
.
Expected result:
file1.csv:
yrdi_391 111 1.11 1.0 1.1 111.0
yfyrn_9132 222 2.22 2.0 2.2 222.0
hfeisk_3 333 3.33 3.0 3.3 333.0
file2.csv:
yrdi_391 444 4.44 4.0 4.4 444.0
yfyrn_9132 555 5.55 5.0 5.5 555.0
hfeisk_3 666 6.66 6.0 6.6 666.0
As of now, the lines in file1.csv
and file2.csv
are not sorted.
If required, we may do the sorting first and then apply the deletion.
Pandas CVS related manipulations are preferred as in both of the files we have headers and need to keep them.
Newbie in python scripting!
Any help will be highly appreciated!
You can use isin()
.
print (df)
0 1 2 3 4 5
0 yrdi_391 111 1.11 1.0 1.1 111.0
1 yfyrn_9132 222 2.22 2.0 2.2 222.0
2 kdkfke_392 999 9.99 9.0 9.9 999.0
3 hfeisk_3 333 3.33 3.0 3.3 333.0
print (df1)
0 1 2 3 4 5
0 yrdi_391 444 4.44 4.0 4.4 444.0
1 yfyrn_9132 555 5.55 5.0 5.5 555.0
2 hfeisk_3 666 6.66 6.0 6.6 666.0
3 fhedn_271 888 8.88 8.0 8.8 888.0
csv_df = df[df[0].isin(df1[0])]
print (csv_df)
0 1 2 3 4 5
0 yrdi_391 111 1.11 1.0 1.1 111.0
1 yfyrn_9132 222 2.22 2.0 2.2 222.0
3 hfeisk_3 333 3.33 3.0 3.3 333.0
csv_df1 = df1[df1[0].isin(df[0])]
print (csv_df1)
0 1 2 3 4 5
0 yrdi_391 444 4.44 4.0 4.4 444.0
1 yfyrn_9132 555 5.55 5.0 5.5 555.0
2 hfeisk_3 666 6.66 6.0 6.6 666.0
csv_df.to_csv('temp.csv', index=False)
csv_df1.to_csv('temp1.csv', index=False)