Search code examples
pandasdataframemulti-index

Pandas: keep first row of duplicated indices of second level of multi index


I found lots of drop_duplicates for index when both multi level indices are the same but, I would like to keep the first row of a multi index when the second level of the multi index has duplicates. So here:

|                                |   col_0 |   col_1 |   col_2 |   col_3 |   col_4 |
|:-------------------------------|--------:|--------:|--------:|--------:|--------:|
|  date         |      ID        
| ('2022-01-01', 'identifier_0') |      26 |      46 |      44 |      21 |      10 |
| ('2022-01-01', 'identifier_1') |      25 |      45 |      83 |      23 |      45 |
| ('2022-01-01', 'identifier_2') |      42 |      79 |      55 |       5 |      78 |
| ('2022-01-01', 'identifier_3') |      32 |       4 |      57 |      19 |      61 |
| ('2022-01-01', 'identifier_4') |      30 |      25 |       5 |      93 |      72 |
| ('2022-01-02', 'identifier_0') |      42 |      14 |      56 |      43 |      42 |
| ('2022-01-02', 'identifier_1') |      90 |      27 |      46 |      58 |       5 |
| ('2022-01-02', 'identifier_2') |      33 |      39 |      53 |      94 |      86 |
| ('2022-01-02', 'identifier_3') |      32 |      65 |      98 |      81 |      64 |
| ('2022-01-02', 'identifier_4') |      48 |      31 |      25 |      58 |      15 |
| ('2022-01-03', 'identifier_0') |       5 |      80 |      33 |      96 |      80 |
| ('2022-01-03', 'identifier_1') |      15 |      86 |      45 |      39 |      62 |
| ('2022-01-03', 'identifier_2') |      98 |       3 |      42 |      50 |      83 |

I'd like to keep first rows with unique ID.


Solution

  • If your index is a MultiIndex:

    >>> df.loc[~df.index.get_level_values('ID').duplicated()]
    
                             col_0  col_1  col_2  col_3  col_4
    date       ID                                             
    2022-01-01 identifier_0     26     46     44     21     10
               identifier_1     25     45     83     23     45
               identifier_2     42     79     55      5     78
               identifier_3     32      4     57     19     61
               identifier_4     30     25      5     93     72
    
    # Or
    >>> df.groupby(level='ID').first()
    
                  col_0  col_1  col_2  col_3  col_4
    ID                                             
    identifier_0     26     46     44     21     10
    identifier_1     25     45     83     23     45
    identifier_2     42     79     55      5     78
    identifier_3     32      4     57     19     61
    identifier_4     30     25      5     93     72
    

    If your index is an Index:

    >>> df.loc[~df.index.str[1].duplicated()]
    
                                col_0  col_1  col_2  col_3  col_4
    (2022-01-01, identifier_0)     26     46     44     21     10
    (2022-01-01, identifier_1)     25     45     83     23     45
    (2022-01-01, identifier_2)     42     79     55      5     78
    (2022-01-01, identifier_3)     32      4     57     19     61
    (2022-01-01, identifier_4)     30     25      5     93     72
    
    >>> df.groupby(df.index.str[1]).first()
                  col_0  col_1  col_2  col_3  col_4
    identifier_0     26     46     44     21     10
    identifier_1     25     45     83     23     45
    identifier_2     42     79     55      5     78
    identifier_3     32      4     57     19     61
    identifier_4     30     25      5     93     72