Search code examples
pythonpandasmulti-index

Combining Excel Files with MultiIndexed Columns in Pandas


I was processing some raw data and created 3 separate dataframes that I exported to excel. One of these dataframes is a regular column with rows, the other two are multiindexed columns with rows. I want to combine all three columns into a single dataframe (columns side by side). I've tried concatenating on axis=1 (columns) and the returned dataframe lost some column names (due to different sized indexes). I cannot use merge as these are three separate dataframes with no shared columns.

Below are samples of what these 3 dataframes look like separately. 

df1 = | A        | B        |
      |----------|----------|
      | 10       | 11       |
      | 20       | 21       |


df2=  |          X          |
      |---------------------| 
      | C        | D        |
      |----------|----------|
      | 5        | 6        |
      | 30       | 9        |


df3 = |          Y          |
      |---------------------| 
      | J        | K        |
      |--------- |--------- |
      |Q         | R        |
      |----------|----------|
      | 10       | 11       |
      | 20       | 21       |


I would like to be able to have the above side by side and export it in a single dataframe. Any advice?

As mentioned earlier - I tried pd.concatenate and it deleted some of the column names from the multiindex dataframes and merge is not possible due to not having a shared column. 

Solution

  • Option 1: Prepend MultiIndex Levels

    You could prepend empty MultiIndex levels to df1 and df2 so that the levels match across all data frames.

    # Prepare session
    import pandas as pd
    
    # Recreate data
    df1 = pd.DataFrame({'A':[10, 20], 'B':[11, 21]})
    df2 = pd.DataFrame({'C': [5, 30], 'D': [6, 9]})
    df2.columns = pd.MultiIndex.from_arrays([('X', 'X'), ('C', 'D')])
    df3 = pd.DataFrame({'Q': [10, 20], 'R': [11, 21]})
    df3.columns = pd.MultiIndex.from_arrays([('Y', 'Y'), ('J', 'K'), ('Q', 'R')])
    
    # Adding empty multi-indices to match df3
    df1 = pd.concat([df1], keys=[('', '')], axis=1)
    df2 = pd.concat([df2], keys=[('')], axis=1)
    
    # Combine data frames
    df = pd.concat([df1, df2, df3], axis=1)
    

    Option 2: Drop MultiIndex Levels

    You can drop all but the last MultiIndex levels.

    # Drop Multiindex Levels
    df2.columns = df2.columns.get_level_values(-1)
    df3.columns = df3.columns.get_level_values(-1)
    
    # Combine data frame
    df = pd.concat([df1, df2, df3], axis=1)
    

    Option 3: Unflatten Column Index

    I believe, pandas flattens the mismatching MultiIndex levels into one level (e.g., ['A', 'B', ('X', 'C'), ('X', 'D'), ('Y', 'J', 'Q'), ('Y', 'K', 'R')] which could be reversed.

    # Combine data frames
    df = pd.concat([df1, df2, df3], axis=1)
    
    # Unflatten Column Indices
    dfc = pd.DataFrame.from_records(list(df.columns)).T
    l = [list(dfc[c].sort_values(key=lambda x: x.isna(), ascending=False)) for c in dfc]
    a = pd.DataFrame(l).fillna('').T.to_numpy()
    df.columns = pd.MultiIndex.from_arrays(a)