Search code examples
pythonpandasmulti-index

Get first instance multi-index and retaining index values


I have a df with a multi-index. Both IDX1 and IDX2 have duplicate values ('C' and 'Y' respectively), and I want to retain the first instance of both and keep the index value; df_desired is what I want the result to look like.

I've tried df_tst = df.groupby(level=[0,2]).first() but this drops IDX2 from the df so I have no way of removing the duplicate from IDX1.

# starting data
import pandas as pd
data = \
[['31/01/2021','A','X',10,15],
['28/02/2021','A','X',20,30],
['31/03/2021','A','X',30,45],
['31/01/2021','B','Y',20,15],
['28/02/2021','B','Y',20,15],
['31/03/2021','B','Y',30,30],
['31/01/2021','C','Z',40,45],
['28/02/2021','C','Z',50,55],
['31/03/2021','C','Z',60,65],
['31/01/2021','C','Q',40,45],
['28/02/2021','C','Q',50,55],
['31/03/2021','C','Q',60,65],
['31/01/2021','D','Y',20,15],
['28/02/2021','D','Y',20,15],
['31/03/2021','D','Y',30,30]]

df=pd.DataFrame(data)
df.columns = ['DATE','IDX1','IDX2','VAR1','VAR2']
df['DATE'] = pd.to_datetime(df['DATE'])
df.set_index(['IDX1','IDX2','DATE'], inplace=True)

# target df:
df_desired = \
[['31/01/2021','A','X',10,15],
['28/02/2021','A','X',20,30],
['31/03/2021','A','X',30,45],
['31/01/2021','B','Y',20,15],
['28/02/2021','B','Y',20,15],
['31/03/2021','B','Y',30,30],
['31/01/2021','C','Z',40,45],
['28/02/2021','C','Z',50,55],
['31/03/2021','C','Z',60,65]]

Solution

  • Use:

    #get first indices per first and second level with remove last level of MultiIndex
    df1 = df.droplevel(-1)
    idx1 = df1.groupby(level=0).head(1).index
    idx2 = df1.groupby(level=1).head(1).index
    
    #compare with intersecton of both levels
    df = df[df1.index.isin(idx1.intersection(idx2))]
    print (df)
                          VAR1  VAR2
    IDX1 IDX2 DATE                  
    A    X    2021-01-31    10    15
              2021-02-28    20    30
              2021-03-31    30    45
    B    Y    2021-01-31    20    15
              2021-02-28    20    15
              2021-03-31    30    30
    C    Z    2021-01-31    40    45
              2021-02-28    50    55
              2021-03-31    60    65
    

    Or:

    #test duplicated values by first and second level with helper df1 DataFrame
    df1 = df.index.to_frame()
    df2 = df.droplevel(-1)
    idx = df2.index[~df1.duplicated(['IDX1']) & ~df1.duplicated(['IDX2'])]
    
    df = df[df2.index.isin(idx)]
    print (df)
                          VAR1  VAR2
    IDX1 IDX2 DATE                  
    A    X    2021-01-31    10    15
              2021-02-28    20    30
              2021-03-31    30    45
    B    Y    2021-01-31    20    15
              2021-02-28    20    15
              2021-03-31    30    30
    C    Z    2021-01-31    40    45
              2021-02-28    50    55
              2021-03-31    60    65