Search code examples
pythonpandasmulti-index

Delete rows with conditions (Multi-Index case)


I'm new to Stack Overflow and I have this Data Set :

df=pd.DataFrame({'ID': {0: 4, 1: 4, 2: 4, 3: 88, 4: 88, 5: 323, 6: 323},
         'Step': {0: 'A', 1: 'Bar', 2: 'F', 3: 'F', 4: 'Bar', 5: 'F', 6: 'A'},
         'Num': {0: 38, 1: 38, 2: 38, 3: 320, 4: 320, 5: 433, 6: 432},
         'Date': {0: '2018-08-02',
          1: '2018-12-02',
          2: '2019-03-02',
          3: '2017-03-02',
          4: '2018-03-02',
          5: '2020-03-04',
          6: '2020-02-03'},
         'Occurence': {0: 3, 1: 3, 2: 3, 3: 2, 4: 2, 5: 2, 6: 2}})

The variables 'ID' and 'Step' are Multi-index.

I would like to do two things :

FIRST :

If 'Num' is different for the same 'ID', then delete the rows of this ID.

SECONDLY :

For a same ID, the step 'F' should be the last one (with the most recent date). If not, then delete the rows of this ID.

I have some difficulties because the commands df['Step'] and df['ID'] are NOT WORKING ('ID' and 'Step' are Multi-Index cause of a recent groupby() ).

I've tried groupby(level=0) that I found on Multi index dataframe delete row with maximum value per group

But I still have some difficulties.

Could someone please help me?

Expected Output :

df=pd.DataFrame({'ID': {0: 4, 1: 4, 2: 4},
         'Step': {0: 'A', 1: 'Bar', 2: 'F'},
         'Num': {0: 38, 1: 38, 2: 38},
         'Date': {0: '2018-08-02',
          1: '2018-12-02',
          2: '2019-03-02',
         'Occurence': {0: 3, 1: 3, 2: 3}})

The ID 88 has been removed because the step 'F' was not the last one step (with the most recent date). The ID 323 has been removed because Num 433!=Num 432.


Solution

  • Since you stated that ID and Step are in the index, we can do it this way:

    df1[df1.sort_values('Date').groupby('ID')['Num']\
           .transform(lambda x: (x.nunique() == 1) & 
                                (x.index.get_level_values(1)[-1] == 'F'))]
    

    Output:

             Num        Date  Occurence
    ID Step                            
    4  A      38  2018-08-02          3
       Bar    38  2018-12-02          3
       F      38  2019-03-02          3
    

    How?

    • First sort the dataframe by 'Date'
    • Then group the dataframe by ID
    • Taking each group of the dataframe and using the 'Num' column to transform in a boolean series, we first get the number of unique elements of 'Num' in that group, if that number is equal to 1, then you know that in that group all 'Num's are the same and that is True
    • Secondly, and we get the inner level of the MultiIndex (level=1) and we check the last value using indexing with [-1], if that value is equal to 'F' then have a True also