https://pandas.pydata.org/docs/user_guide/merging.html#dataframe-join
When DataFrame are joined using only some of the levels of a MultiIndex, the extra levels will be dropped from the resulting join. To preserve those levels, use DataFrame.reset_index() on those level names to move those levels to columns prior to the join.
How is this passage to be understood?
Above, an example of a join when using a subset is mentioned, where index --- 'num' still exists and broadcasts, It struck me as strange. If the right index is not a subset of the left index, then it will report an error, this tried, even using on to specify the common index does not work
In [109]: leftindex = pd.MultiIndex.from_product(
.....: [list("abc"), list("xy"), [1, 2]], names=["abc", "xy", "num"]
.....: )
.....:
In [110]: left = pd.DataFrame({"v1": range(12)}, index=leftindex)
In [111]: left
Out[111]:
v1
abc xy num
a x 1 0
2 1
y 1 2
2 3
b x 1 4
2 5
y 1 6
2 7
c x 1 8
2 9
y 1 10
2 11
In [112]: rightindex = pd.MultiIndex.from_product(
.....: [list("abc"), list("xy")], names=["abc", "xy"]
.....: )
.....:
In [113]: right = pd.DataFrame({"v2": [100 * i for i in range(1, 7)]}, index=rightindex)
In [114]: right
Out[114]:
v2
abc xy
a x 100
y 200
b x 300
y 400
c x 500
y 600
In [115]: left.join(right, on=["abc", "xy"], how="inner")
Out[115]:
v1 v2
abc xy num
a x 1 0 100
2 1 100
y 1 2 200
2 3 200
b x 1 4 300
2 5 300
y 1 6 400
2 7 400
c x 1 8 500
2 9 500
y 1 10 600
2 11 600
So, how is this passage to be understood?
This warning applies to merge
not join
.
join
If you don't provide on
, there is an automatic alignment of levels with join
on the common levels and no level is dropped.
Let's assume a bit more complex example in which there are items present in different levels to better appreciate:
leftindex = pd.MultiIndex.from_product(
[list("abc"), list("xy"), [1, 2]], names=["abc", "xy", "num"]
)
left = pd.DataFrame({"v1": range(12)}, index=leftindex)
rightindex = pd.MultiIndex.from_product(
[list("abx"), list("cy"), list('z')], names=["xy", "abc", 'Z']
)
right = pd.DataFrame({"v2": [100 * i for i in range(1, 7)]}, index=rightindex)
left.join(right, how='inner')
# v1 v2
# abc xy num Z
# c x 1 z 8 500
# 2 z 9 500
As you can see, the levels were correctly aligned on their name.
If you manually pass on
, now join
will completely ignore the level names in right and take them in order:
left.join(right, on=['xy', 'abc'], how='inner')
# ValueError: len(left_on) must equal the number of levels in the index of "right"
left.join(right.droplevel('Z'), on=['abc', 'xy'], how='inner')
# v1 v2
# abc xy num
# a y 1 2 200
# 2 3 200
# b y 1 6 400
# 2 7 400
merge
However, if you use merge
, now the extra levels in right
are dropped:
left.merge(right, left_index=True, right_index=True)
# v1 v2
# abc xy num
# c x 1 8 500
# 2 9 500
And if you pass on
, then only the common levels are kept:
left.merge(right, on=["abc", "xy"], how="inner")
# v1 v2
# abc xy
# c x 8 500
# x 9 500