I have 2 csv files. I need to delete all rows of the first file in which the first column have a string found in the first column of the second file. The head of table 1 is:
Genus | FAGR | MOCA | MUBR | MUHA |
---|---|---|---|---|
1-14-0-20-45-16 | 0 | 0 | 40 | 0 |
1-14-0-20-46-22 | 0 | 0 | 0 | 169 |
2-02-FULL-61-13 | 0 | 0 | 0 | 27 |
2-12-FULL-35-15 | 56 | 182 | 435 | 311 |
The head of table 2 is:
Genus | FAGR | MOCA | MUBR |
---|---|---|---|
1-14-0-20-46-22 | 0 | 0 | 0 |
2-02-FULL-61-13 | 0 | 0 | 0 |
21-14-0-10-47-8-A | 0 | 0 | 0 |
AAA536-G1 | 0 | 0 | 0 |
The expected output file contains the rows of file 1 except the rows that match the first 2 rows of the second file (which have in common the following strings in the first column: 1-14-0-20-46-22 and 2-02-FULL-61-13). When the complete files are compared, the entire file 2 has to be removed from file 1.
I am going through pandas indexing and selecting data but still cannot find a solution, probably becase I am a newbie.
I tried the solution posted and it came out like this:
df1 = generagrouped_df
df2['drop_key'] = 'DROP'
output = pd.merge(
left = df1,
right = df2,
how = 'left',
left_on = ['Genus'],
right_on = ['Genus']
)
output.drop(output[output['drop_key'] == 'DROP'].index, inplace = True)
The error message was KeyError: 'drop_key' (below):
KeyError Traceback (most recent call last)
<ipython-input-103-67d27afa824b> in <module>()
----> 1 output.drop(output[output['drop_key'] == 'DROP'].index, inplace = True)
/Users/AnaPaula/opt/anaconda2/lib/python2.7/site-packages. /pandas/core/frame.pyc in __getitem__(self, key)
2925 if self.columns.nlevels > 1:
2926 return self._getitem_multilevel(key)
-> 2927 indexer = self.columns.get_loc(key)
2928 if is_integer(indexer):
2929 indexer = [indexer]
/Users/AnaPaula/opt/anaconda2/lib/python2.7/site-packages/pandas/core/indexes/base.pyc in get_loc(self, key, method, tolerance)
2657 return self._engine.get_loc(key)
2658 except KeyError:
-> 2659 return self._engine.get_loc(self._maybe_cast_indexer(key))
2660 indexer = self.get_indexer([key], method=method, tolerance=tolerance)
2661 if indexer.ndim > 1 or indexer.size > 1:
pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()
pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()
KeyError: 'drop_key'
Can you figure out the solution? Thanks AP
I found a solution. Since the entire file 2 had to be removed from file 1, I did the following command, which informed just the first column to be compared, and it worked:
df1.loc[pd.merge(df1, df2, on=['Genus'], how='left', indicator=True)['_merge'] == 'left_only']
Thanks for you time! AP