I have the following dataframe in pandas:
id name categoryids shops
5 239 Boulanger [5] 152
3 196 Bouygues Telecom [5] 500
4 122 Darty [5,3] 363
1 311 Electro Dépôt [5] 81
0 2336 Orange [15] 578
2 194 Orange [5] 577
I would like to drop the 5th row because it's duplicated in name but has a different value in the column categoryids, but as the values are arrays (as they can have more than one value), I have problem comparing them.
My idea was to check the mode of this column and discard all rows that don't have this value in its array (for example, in this case, the mode would be 5, so the 5th column should be discarded as this value is not present in its array), but I have problems calculating this value as the column is an array, not a single value.
Any ideas or suggestions on how to do this?
I'm using python 3.7 and last version of pandas.
Thank you.
With a DataFrame like this:
df = pd.DataFrame({'id': [239,196,122,311,2336,194,],
'name': ['Boulanger','Bouygues Telecom','Darty','Electro Dépôt','Orange','Orange',],
'shops': [152, 500, 363, 81, 578, 577,],
'categoryids': [[5],[5],[5,3],[5],[15],[5],]})
You can do:
df.sort_values('categoryids').drop_duplicates('name', keep='first')
Which sorts the categoryids
column, then drops the duplicates in name
and keeps the first of it.
EDIT:
Another thing you can do is to check whether the value you are after in the categoryids
column exist:
df["exist"] = [int(5 in r) for r in df["categoryids"]]
Which will give you:
id name shops categoryids exist
0 239 Boulanger 152 [5] 1
1 196 Bouygues Telecom 500 [5] 1
2 122 Darty 363 [5, 3] 1
3 311 Electro Dépôt 81 [5] 1
4 2336 Orange 578 [15] 0
5 194 Orange 577 [5] 1
And then you can take only the ones that exist:
df[df['exist'] == 1]
Then find the duplicates using pd.duplicated()
as @Erfan mentioned:
df['dups'] = df['name'].duplicated(keep=False).astype(int)
id name shops categoryids exist dups
0 239 Boulanger 152 [5] 1 0
1 196 Bouygues Telecom 500 [5] 1 0
2 122 Darty 363 [5, 3] 1 0
3 311 Electro Dépôt 81 [5] 1 0
4 2336 Orange 578 [15] 0 1
5 194 Orange 577 [5] 1 1
df[(
(df['dups']!=1) |
(df['exist']!=0)
)].drop(['exist', 'dups'], axis=1).reset_index()
Would result in:
index id name shops categoryids
0 0 239 Boulanger 152 [5]
1 1 196 Bouygues Telecom 500 [5]
2 2 122 Darty 363 [5, 3]
3 3 311 Electro Dépôt 81 [5]
4 5 194 Orange 577 [5]