Search code examples
pythonpandasdataframemulti-index

Replace MultiIndex names for the nth position in all levels


I have about 300 csv files that I want to merge into a single one using pandas. All of them have 3 rows for variable names (project in the first one, Device_1 in the second one and variable in the third one). The first column goes something like ('Asset','Element','Date') and continues the timeseries values. Sometimes the names are in other languages, but there are 100 files that have their first row as ('ts',nan,nan), so I need to replace the first "column" of the MultiIndex names in order to be able to sort it by date after I finished merging the files.

df.iloc[:3,:5]
Out[16]: 
                    ts       Asset_1                                             
                   nan      Device_1                        Device_2     Device_3
                   nan     Variable_1         Variable_2  Variable_1    Variable_1
0  2022-12-31 00:00:00         0.0                NaN          0.0          0.0
1  2022-12-31 00:05:00         0.0                NaN          0.0          0.0
2  2022-12-31 00:10:00         0.0                NaN          0.0          0.0

So basically what I want to have is the first "column" of the multiindex names the same as my new_cols values

                 Asset       Asset_1                                             
                 Device      Device_1                        Device_2     Device_3
                 Date       Variable_1         Variable_2  Variable_1    Variable_1
0  2022-12-31 00:00:00         0.0                NaN          0.0          0.0
1  2022-12-31 00:05:00         0.0                NaN          0.0          0.0
2  2022-12-31 00:10:00         0.0                NaN          0.0          0.0

So far I've tried iterating through the three levels:

new_cols = ['Asset','Element','Date']
for i in range(3):
    df.rename(columns={df.columns[0][i]:new_cols[i]},inplace=True)

However, as this method doesn't replace by position and instead does it by name, as the second and third row have the same name (nan) the final names are ('Asset','Date','Date') and when I try to merge it with the other files the final dataframe has two datetime columns.

I tried using set_levels, but I have to put the verify_integrity to False because otherwise I receive an error:

new_cols = ['Asset','Element','Date']
for i in range(3):
updated_columns = [new_cols[i]] + list(df.columns.get_level_values(i)[1:])
df.columns.set_levels(updated_columns,level=i,verify_integrity=True)

ValueError: Level values must be unique: ['Asset', 'Asset_1','Asset_1','Asset_1','Asset_1'] on level 0

But then if I mark verify_integrity as False the final names get mixed up and I don't know why.

Is there a way to replace any name in a MultiIndex with a given position? Something like iloc, for example, something like

pd.Multiindex.iloc[0,2]='Date'

Solution

  • Convert MultiIndex to list of tuples and set new values:

    new_cols = ['Asset','Element','Date']
    
    L = df.columns.tolist()
    L[0] = tuple(new_cols)
    print (L)
    [('Asset', 'Element', 'Date'), 
     ('Asset_1', 'Device_1', 'Variable_1'), 
     ('Asset_1', 'Device_1', 'Variable_2'), 
     ('Asset_1', 'Device_2', 'Variable_1'), 
     ('Asset_1', 'Device_3', 'Variable_1')]
    
    df.columns = pd.MultiIndex.from_tuples(L)
    print (df)
                     Asset    Asset_1                                 
                   Element   Device_1              Device_2   Device_3
                      Date Variable_1 Variable_2 Variable_1 Variable_1
    0  2022-12-31 00:00:00        0.0        NaN        0.0        0.0
    1  2022-12-31 00:05:00        0.0        NaN        0.0        0.0
    2  2022-12-31 00:10:00        0.0        NaN        0.0        0.0
    

    Or you can use helper DataFrame is also possible, but slowier:

    new_cols = ['Asset','Element','Date']
    
    df1 = df.columns.to_frame()
    df1.iloc[0] = new_cols
    
    df.columns = pd.MultiIndex.from_frame(df1, names=df.columns.names)
    print (df)
                     Asset    Asset_1                                 
                   Element   Device_1              Device_2   Device_3
                      Date Variable_1 Variable_2 Variable_1 Variable_1
    0  2022-12-31 00:00:00        0.0        NaN        0.0        0.0
    1  2022-12-31 00:05:00        0.0        NaN        0.0        0.0
    2  2022-12-31 00:10:00        0.0        NaN        0.0        0.0