Search code examples
pythonpandasdataframejoinmulti-index

Join dataframes on a mix of multiindex columns and data columns


segments = pd.DataFrame({
    'shipment_id': [1, 1, 1, 2, 2],
    'segment_order': [1, 2, 3, 1, 2],
    'container_id': [None, 'aa', 'bb', 'cc', None]
}).set_index(['shipment_id', 'segment_order'])
print(segments)

containers = pd.DataFrame({
    'shipment_id': [1, 1, 2],
    'container_id': ['aa', 'bb', 'cc'],
    'temperature_max': [24.6, 25.3, 31.8],
}).set_index(['shipment_id', 'container_id'])
print(containers)
                          container_id
shipment_id segment_order             
1           1                     None
            2                       aa
            3                       bb
2           1                       cc
            2                     None

                          temperature_max
shipment_id container_id                 
1           aa                       24.6
            bb                       25.3
2           cc                       31.8

I would like to join the containers['temperature_max'] column to the segments data frame. Note that the join condition involves a mix of regular data columns and multi-index columns.

What is the preferred way to perform this join?

The expected output is:

                          container_id temperature_max
shipment_id segment_order             
1           1                     None             NaN
            2                       aa            24.6
            3                       bb            25.3
2           1                       cc            31.8
            2                     None             NaN

Solution

  • While merge can work with index name, it will happily erase old index and create new ones (makes sense). For your data, you can temporary unset segment_order index level, merge, then set segment_order back as index:

    (segments.reset_index(level='segment_order')
        .merge(containers, on=['shipment_id', 'container_id'], how='left')
        .set_index('segment_order', append=True)
    )
    

    Output:

                              container_id  temperature_max
    shipment_id segment_order                              
    1           1                     None              NaN
                2                       aa             24.6
                3                       bb             25.3
    2           1                       cc             31.8
                2                     None              NaN