Search code examples
pythonpandasmulti-indexdrop-duplicates

Pandas multiindex duplicated only for particular indices


Say I have a Pandas dataframe with multiple indices:

arrays = [["UK", "UK", "US", "FR"], ["Firm1", "Firm1", "Firm2", "Firm1"], ["Andy", "Peter", "Peter", "Andy"]]
idx = pd.MultiIndex.from_arrays(arrays, names = ("Country", "Firm", "Responsible"))
df = pd.DataFrame(np.zeros(4), index = idx)
df

                             0
Country Firm  Responsible     
UK      Firm1 Andy         0.0
              Peter        0.0
US      Firm2 Peter        0.0
FR      Firm1 Andy         0.0

I want to drop duplicated entries of the first two index levels (In the example, rows with "UK" and "Firm1" entries) and keep only the rows, where the third index "Responsible" is equal to "Andy". So I want to drop the second row in this case.

In pandas there is drop_duplicates() but I don't see how I can i) apply it only on the first two index levels and ii) specify to keep rows with "Andy" and drop the rest (the function only allows for 'first' and 'last' as arguments)

I would be happy for inputs! Many thanks in advance.


Solution

  • You want to remove entries if the first 2 levels are duplicated and Andy does not appear in the Responsible level: First use Index.to_frame for DataFrame, and then test both first 2 levels for duplicates with DataFrame.duplicated, setting the keep=False argument for all duplicates. Finally, filter only the Andy rows by chaining another mask with the bitwise "OR" operator |.

    df1 = df.index.to_frame()
    
    df = df[~df1.duplicated(subset=['Country','Firm'], keep=False) | 
             df1['Responsible'].eq('Andy')]
    print(df)
                                 0
    Country Firm  Responsible     
    UK      Firm1 Andy         0.0
    US      Firm2 Peter        0.0
    FR      Firm1 Andy         0.0