Search code examples
pythonpandasdataframemergemulti-index

Concatenate dataframes with initially mismatching index structures (columns)


I have one bigger dataframe and a small one which only has one row.

the bigger one

route            TC2_37               ...             TD25             
                 value    daily_change  ...         value    daily_change
period                                ...                              
Aug 23          20339.0       4018.0  ...          26569.0       -951.0
Sep 23          19737.0       3037.0  ...          32725.0       -507.0
Oct 23          19821.0       1316.0  ...          38033.0        -18.0
Nov 23          20803.0        580.0  ...          40282.0       -188.0
Dec 23          22070.0        115.0  ...          42195.0       -148.0
Q3 23           18158.0       1891.0  ...          31269.0      -1102.0
Q4 23           20899.0        672.0  ...          40170.0       -117.0
Q1 24           16361.0        363.0  ...          37983.0       -125.0
Q2 24           14581.0        380.0  ...          28731.0        546.0
Q3 24           13029.0        415.0  ...          27840.0        628.0
Q4 24           16701.0        310.0  ...          33390.0        520.0
Cal 24          15168.0        367.0  ...          31986.0        393.0
Cal 25          13950.0         98.0  ...          30712.0        139.0

some columns are not shown but they all have same structures

the small dataframe looks like this:

route         A6TCE   BCTI   BDTI   MA2TCE  ...     TD7      TD8      TD25    V2TCE
period                                      ...                                   
2023-08-02  17134.0  720.0  821.0  28859.0  ...  9917.0  31700.0  10408.0  11800.0

The small dataframe has more routes than the bigger one,

I wish to create a new dataframe which has the small dataframe as the first row, but only with the columns(routes) which overlaps. And only under the column "value", NOT "daily_change"

    route            TC2_37               ...             TD25             
                     value    daily_change  ...         value    daily_change
    period
2023-08-02           990.0                  ...        10408.0                                    
    Aug 23          20339.0       4018.0  ...          26569.0       -951.0
    Sep 23          19737.0       3037.0  ...          32725.0       -507.0
    Oct 23          19821.0       1316.0  ...          38033.0        -18.0
    Nov 23          20803.0        580.0  ...          40282.0       -188.0
    Dec 23          22070.0        115.0  ...          42195.0       -148.0
    Q3 23           18158.0       1891.0  ...          31269.0      -1102.0
    Q4 23           20899.0        672.0  ...          40170.0       -117.0
    Q1 24           16361.0        363.0  ...          37983.0       -125.0
    Q2 24           14581.0        380.0  ...          28731.0        546.0
    Q3 24           13029.0        415.0  ...          27840.0        628.0
    Q4 24           16701.0        310.0  ...          33390.0        520.0
    Cal 24          15168.0        367.0  ...          31986.0        393.0
    Cal 25          13950.0         98.0  ...          30712.0        139.0

Reproduce this part of the bigger dataframe from dict:

{('TC2_37', 'value'): {'Aug 23': 20339.0, 'Sep 23': 19737.0, 'Oct 23': 19821.0, 'Nov 23': 20803.0, 'Dec 23': 22070.0, 'Q3 23': 18158.0, 'Q4 23': 20899.0, 'Q1 24': 16361.0, 'Q2 24': 14581.0, 'Q3 24': 13029.0, 'Q4 24': 16701.0, 'Cal 24': 15168.0, 'Cal 25': 13950.0}, 
 ('TC2_37', 'daily_change'): {'Aug 23': 4018.0, 'Sep 23': 3037.0, 'Oct 23': 1316.0, 'Nov 23': 580.0, 'Dec 23': 115.0, 'Q3 23': 1891.0, 'Q4 23': 672.0, 'Q1 24': 363.0, 'Q2 24': 380.0, 'Q3 24': 415.0, 'Q4 24': 310.0, 'Cal 24': 367.0, 'Cal 25': 98.0}, 
 ('TD25', 'value'): {'Aug 23': 26569.0, 'Sep 23': 32725.0, 'Oct 23': 38033.0, 'Nov 23': 40282.0, 'Dec 23': 42195.0, 'Q3 23': 31269.0, 'Q4 23': 40170.0, 'Q1 24': 37983.0, 'Q2 24': 28731.0, 'Q3 24': 27840.0, 'Q4 24': 33390.0, 'Cal 24': 31986.0, 'Cal 25': 30712.0}, 
 ('TD25', 'daily_change'): {'Aug 23': -951.0, 'Sep 23': -507.0, 'Oct 23': -18.0, 'Nov 23': -188.0, 'Dec 23': -148.0, 'Q3 23': -1102.0, 'Q4 23': -117.0, 'Q1 24': -125.0, 'Q2 24': 546.0, 'Q3 24': 628.0, 'Q4 24': 520.0, 'Cal 24': 393.0, 'Cal 25': 139.0}}

Solution

  • The columns of the large dataframe are multi-indexed, whereas the small dataframe has them flat. A merge requires compatible columns. Therefore, either flatten the large, or multi-index the small one.

    Here, multi-indexing the small dataframe:

    Input data: "bigger" dataframe

    Periods    = ['Aug 23','Sep 23','Oct 23','Nov 23','Dec 23','Q3 23','Q4 23','Q1 24','Q2 24','Q3 24','Q4 24','Cal 24','Cal 25']
    Routes     = ['TC2_37', 'TD25']
    Categories = ['value','daily_change']
    Data = [[20339.0,4018.0,26569.0,-951.0],
            [19737.0,3037.0,32725.0,-507.0],
            [19821.0,1316.0,38033.0,-18.0],
            [20803.0,580.0,40282.0,-188.0],
            [22070.0,115.0,42195.0,-148.0],
            [18158.0,1891.0,31269.0,-1102.0],
            [20899.0,672.0,40170.0,-117.0],
            [16361.0,363.0,37983.0,-125.0],
            [14581.0,380.0,28731.0,546.0],
            [13029.0,415.0,27840.0,628.0],
            [16701.0,310.0,33390.0,520.0],
            [15168.0,367.0,31986.0,393.0],
            [13950.0,98.0,30712.0,139.0]]
    
    DF = pd.DataFrame(index   = pd.Index(Periods, name = 'periods'),
                      columns = pd.MultiIndex.from_product([Routes, Categories]),
                      data    = Data)
    DF
              TC2_37                  TD25             
               value daily_change    value daily_change
    periods                                            
    Aug 23   20339.0       4018.0  26569.0       -951.0
    Sep 23   19737.0       3037.0  32725.0       -507.0
    Oct 23   19821.0       1316.0  38033.0        -18.0
    Nov 23   20803.0        580.0  40282.0       -188.0
    Dec 23   22070.0        115.0  42195.0       -148.0
    Q3 23    18158.0       1891.0  31269.0      -1102.0
    Q4 23    20899.0        672.0  40170.0       -117.0
    Q1 24    16361.0        363.0  37983.0       -125.0
    Q2 24    14581.0        380.0  28731.0        546.0
    Q3 24    13029.0        415.0  27840.0        628.0
    Q4 24    16701.0        310.0  33390.0        520.0
    Cal 24   15168.0        367.0  31986.0        393.0
    Cal 25   13950.0         98.0  30712.0        139.0
    

    Input data: "small" dataframe

    routes   = ['A6TCE','BCTI','BDTI','MA2TCE','TD7', 'TD8',  'TD25', 'V2TCE']
    values   = [17134.0, 720.0, 821.0, 28859.0,9917.0,31700.0,10408.0,11800.0]
    category = ['value']
    period   = ['2023-08-02']
    

    (1) If you can create it directly with multi-indexed columns:

    df = pd.DataFrame(index   = pd.Index(period, name = 'periods'),
                      columns = pd.MultiIndex.from_product([routes, category]),
                      data    = [values])
    df
                  A6TCE   BCTI   BDTI   MA2TCE     TD7      TD8     TD25    V2TCE
                  value  value  value    value   value    value    value    value
    periods                                                                      
    2023-08-02  17134.0  720.0  821.0  28859.0  9917.0  31700.0  10408.0  11800.0
    

    (2) If you cannot create it directly with multi-indexed columns, then adding the level is simple: How to simply add a column level to a pandas dataframe

    df = pd.DataFrame(index   = pd.Index(period, name = 'periods'),
                      columns = routes,
                      data    = [values])
    df
                  A6TCE   BCTI   BDTI   MA2TCE     TD7      TD8     TD25    V2TCE
    periods                                                                      
    2023-08-02  17134.0  720.0  821.0  28859.0  9917.0  31700.0  10408.0  11800.0
    
    # Add the level:
    df.columns = pd.MultiIndex.from_product([df.columns, ['value']])
    df
                  A6TCE   BCTI   BDTI   MA2TCE     TD7      TD8     TD25    V2TCE
                  value  value  value    value   value    value    value    value
    periods                                                                      
    2023-08-02  17134.0  720.0  821.0  28859.0  9917.0  31700.0  10408.0  11800.0
    

    Merging

    Once index compatibility is there, merging is obvious:

    pd.concat([DF, df], axis=0, join='inner')
    

    Output: As requested,

    • only routes both dataframes had in common
    • and only the "value".
                   TD25
                  value
    periods            
    Aug 23      26569.0
    Sep 23      32725.0
    Oct 23      38033.0
    Nov 23      40282.0
    Dec 23      42195.0
    Q3 23       31269.0
    Q4 23       40170.0
    Q1 24       37983.0
    Q2 24       28731.0
    Q3 24       27840.0
    Q4 24       33390.0
    Cal 24      31986.0
    Cal 25      30712.0
    2023-08-02  10408.0
    

    As for sorting the index with new row on top, for that you will need to rephrase its contents in an homogeneous time unit, i.e. avoid mixing days, months or quarters, ideally using datetime format.