Search code examples
pythonpandasmulti-index

Pandas: partial indexing on multiIndex dataframe doesn't duplicate rows


I am trying to select rows (including repeats) from a two-level pandas MultiIndex dataframe, using .loc indexing, using a list of labels.

However, if I try this type of indexing with a MultiIndex dataframe, the order of rows of the output is the same as the input, and the repeated indices are ignored. Here is an example:

import numpy as np
import pandas as pd
import string as s

index1 = list(s.ascii_uppercase[:4])
index2 = np.arange(2)
col_names='col1 col2 col3'.split()

new_slices = list('DDAB') # note order and repition of labels

multi_index = pd.MultiIndex.from_product([index1,index2],names=["level0","level1"])

data = np.arange(len(index1)*len(index2)*len(col_names))
data=data.reshape(len(index1)*len(index2),-1)


df2 = pd.DataFrame(data,columns=col_names,index=multi_index)

print(df2.loc[new_slices])

               col1  col2  col3
level0 level1                  
A      0          0     1     2
       1          3     4     5
B      0          6     7     8
       1          9    10    11
D      0         18    19    20
       1         21    22    23

I would instead expect:

               col1  col2  col3
level0 level1                  
D      0         18    19    20
       1         21    22    23
D      0         18    19    20
       1         21    22    23
A      0          0     1     2
       1          3     4     5
B      0          6     7     8
       1          9    10    11

Is there a MultiIndex-specific function that I have missed? Or am I misunderstanding how the levels in the MultiIndex work?

(However, this works as I expect when selecting from a 'regular' dataframe, for example:)

import numpy as np
import pandas as pd
import string as s

index1 = list(s.ascii_uppercase[:4])
col_names='col1 col2 col3'.split()

new_slices = list('DDAB') # note order and repition of labels

data1 = np.arange(len(index1)*len(col_names)).reshape(len(index1),-1)
df1 = pd.DataFrame(data1,columns=col_names,index=index1)

print(df1)
print(df1.loc[new_slices])

which gives the result I would expect -- a dataframe with rows D,D,A,B.


Solution

  • You haven't missed anything. This is a result of the way pandas implements indexing. Partial indexing won't duplicate rows, only complete indexing.

    Partial: two of just the first level:

    df2.loc[['A', 'A']]
    
                   col1  col2  col3
    level0 level1                  
    A      0          0     1     2
           1          3     4     5
    

    Partial: two of just the second level:

    df2.loc[(slice(None), [0, 0]), :]
    
                   col1  col2  col3
    level0 level1                  
    A      0          0     1     2
    B      0          6     7     8
    C      0         12    13    14
    D      0         18    19    20
    

    However, complete indexing does duplicate rows:

    df2.loc[[('A', 0),('A', 0)]]
    
                   col1  col2  col3
    level0 level1                  
    A      0          0     1     2
           0          0     1     2
    

    This is the general case of the specific case that indexing on single indexes duplicates rows. All indexing on single indexes is complete, because there's only one level.

    If you want to get duplicate rows through indexing, and not by concating or joining, you can build a complete index list like this:

    complete = [(first, second) for first in new_slices for second in set(df2.index.get_level_values(1))]
    df2.loc[complete]
    
                   col1  col2  col3
    level0 level1                  
    D      0         18    19    20
           1         21    22    23
           0         18    19    20
           1         21    22    23
    A      0          0     1     2
           1          3     4     5
    B      0          6     7     8
           1          9    10    11