Search code examples
pythonpandas

How is this description understood in the pandas documentation?


https://pandas.pydata.org/docs/user_guide/merging.html#dataframe-join

When DataFrame are joined using only some of the levels of a MultiIndex, the extra levels will be dropped from the resulting join. To preserve those levels, use DataFrame.reset_index() on those level names to move those levels to columns prior to the join.

How is this passage to be understood?

Above, an example of a join when using a subset is mentioned, where index --- 'num' still exists and broadcasts, It struck me as strange. If the right index is not a subset of the left index, then it will report an error, this tried, even using on to specify the common index does not work

In [109]: leftindex = pd.MultiIndex.from_product(
   .....:     [list("abc"), list("xy"), [1, 2]], names=["abc", "xy", "num"]
   .....: )
   .....: 

In [110]: left = pd.DataFrame({"v1": range(12)}, index=leftindex)

In [111]: left
Out[111]: 
            v1
abc xy num    
a   x  1     0
       2     1
    y  1     2
       2     3
b   x  1     4
       2     5
    y  1     6
       2     7
c   x  1     8
       2     9
    y  1    10
       2    11

In [112]: rightindex = pd.MultiIndex.from_product(
   .....:     [list("abc"), list("xy")], names=["abc", "xy"]
   .....: )
   .....: 

In [113]: right = pd.DataFrame({"v2": [100 * i for i in range(1, 7)]}, index=rightindex)

In [114]: right
Out[114]: 
         v2
abc xy     
a   x   100
    y   200
b   x   300
    y   400
c   x   500
    y   600

In [115]: left.join(right, on=["abc", "xy"], how="inner")
Out[115]: 
            v1   v2
abc xy num         
a   x  1     0  100
       2     1  100
    y  1     2  200
       2     3  200
b   x  1     4  300
       2     5  300
    y  1     6  400
       2     7  400
c   x  1     8  500
       2     9  500
    y  1    10  600
       2    11  600

So, how is this passage to be understood?


Solution

  • This warning applies to merge not join.

    join

    If you don't provide on, there is an automatic alignment of levels with join on the common levels and no level is dropped.

    Let's assume a bit more complex example in which there are items present in different levels to better appreciate:

    leftindex = pd.MultiIndex.from_product(
        [list("abc"), list("xy"), [1, 2]], names=["abc", "xy", "num"]
    )
     
    
    left = pd.DataFrame({"v1": range(12)}, index=leftindex)
    
    rightindex = pd.MultiIndex.from_product(
         [list("abx"), list("cy"), list('z')], names=["xy", "abc", 'Z']
     )
     
    right = pd.DataFrame({"v2": [100 * i for i in range(1, 7)]}, index=rightindex)
    
    left.join(right, how='inner')
    #               v1   v2
    # abc xy num Z         
    # c   x  1   z   8  500
    #        2   z   9  500
    
    

    As you can see, the levels were correctly aligned on their name.

    If you manually pass on, now join will completely ignore the level names in right and take them in order:

    left.join(right, on=['xy', 'abc'], how='inner')
    # ValueError: len(left_on) must equal the number of levels in the index of "right"
    
    left.join(right.droplevel('Z'), on=['abc', 'xy'], how='inner')
    #             v1   v2
    # abc xy num         
    # a   y  1     2  200
    #        2     3  200
    # b   y  1     6  400
    #        2     7  400
    

    merge

    However, if you use merge, now the extra levels in right are dropped:

    left.merge(right, left_index=True, right_index=True)
    #             v1   v2
    # abc xy num         
    # c   x  1     8  500
    #        2     9  500
    

    And if you pass on, then only the common levels are kept:

    left.merge(right, on=["abc", "xy"], how="inner")
    
    #         v1   v2
    # abc xy         
    # c   x    8  500
    #     x    9  500