Search code examples
pythonpandasarray-column

How to drop pandas row based on values of array column?


I have the following dataframe in pandas:

     id              name categoryids    shops
5   239         Boulanger         [5]      152
3   196  Bouygues Telecom         [5]      500
4   122             Darty       [5,3]      363
1   311     Electro Dépôt         [5]       81
0  2336            Orange        [15]      578
2   194            Orange         [5]      577

I would like to drop the 5th row because it's duplicated in name but has a different value in the column categoryids, but as the values are arrays (as they can have more than one value), I have problem comparing them.

My idea was to check the mode of this column and discard all rows that don't have this value in its array (for example, in this case, the mode would be 5, so the 5th column should be discarded as this value is not present in its array), but I have problems calculating this value as the column is an array, not a single value.

Any ideas or suggestions on how to do this?

I'm using python 3.7 and last version of pandas.

Thank you.


Solution

  • With a DataFrame like this:

    df = pd.DataFrame({'id': [239,196,122,311,2336,194,],
    'name': ['Boulanger','Bouygues Telecom','Darty','Electro Dépôt','Orange','Orange',],
    'shops': [152, 500, 363, 81, 578, 577,],
    'categoryids': [[5],[5],[5,3],[5],[15],[5],]})
    

    You can do:

    df.sort_values('categoryids').drop_duplicates('name', keep='first')
    

    Which sorts the categoryids column, then drops the duplicates in name and keeps the first of it.

    EDIT:

    Another thing you can do is to check whether the value you are after in the categoryids column exist:

    df["exist"] = [int(5 in r)  for r in df["categoryids"]]
    

    Which will give you:

        id              name                shops   categoryids exist
    0   239             Boulanger             152            [5]    1
    1   196             Bouygues Telecom      500            [5]    1
    2   122             Darty                 363         [5, 3]    1
    3   311             Electro Dépôt          81            [5]    1
    4   2336            Orange                578           [15]    0
    5   194             Orange                577            [5]    1
    

    And then you can take only the ones that exist:

    df[df['exist'] == 1]
    

    Then find the duplicates using pd.duplicated() as @Erfan mentioned:

    df['dups'] = df['name'].duplicated(keep=False).astype(int)
    
    
        id  name               shops     categoryids    exist   dups
    0   239 Boulanger            152             [5]    1          0
    1   196 Bouygues Telecom     500             [5]    1          0
    2   122 Darty                363          [5, 3]    1          0
    3   311 Electro Dépôt         81             [5]    1          0
    4   2336    Orange           578            [15]    0          1
    5   194 Orange               577             [5]    1          1
    
    
    df[(
        (df['dups']!=1) | 
        (df['exist']!=0)
    )].drop(['exist', 'dups'], axis=1).reset_index()
    

    Would result in:

    index   id  name               shops    categoryids
    0   0   239 Boulanger            152    [5]
    1   1   196 Bouygues Telecom     500    [5]
    2   2   122 Darty                363    [5, 3]
    3   3   311 Electro Dépôt         81    [5]
    4   5   194 Orange               577    [5]