I have about 300 csv files that I want to merge into a single one using pandas. All of them have 3 rows for variable names (project in the first one, Device_1 in the second one and variable in the third one). The first column goes something like ('Asset','Element','Date') and continues the timeseries values. Sometimes the names are in other languages, but there are 100 files that have their first row as ('ts',nan,nan), so I need to replace the first "column" of the MultiIndex names in order to be able to sort it by date after I finished merging the files.
df.iloc[:3,:5]
Out[16]:
ts Asset_1
nan Device_1 Device_2 Device_3
nan Variable_1 Variable_2 Variable_1 Variable_1
0 2022-12-31 00:00:00 0.0 NaN 0.0 0.0
1 2022-12-31 00:05:00 0.0 NaN 0.0 0.0
2 2022-12-31 00:10:00 0.0 NaN 0.0 0.0
So basically what I want to have is the first "column" of the multiindex names the same as my new_cols values
Asset Asset_1
Device Device_1 Device_2 Device_3
Date Variable_1 Variable_2 Variable_1 Variable_1
0 2022-12-31 00:00:00 0.0 NaN 0.0 0.0
1 2022-12-31 00:05:00 0.0 NaN 0.0 0.0
2 2022-12-31 00:10:00 0.0 NaN 0.0 0.0
So far I've tried iterating through the three levels:
new_cols = ['Asset','Element','Date']
for i in range(3):
df.rename(columns={df.columns[0][i]:new_cols[i]},inplace=True)
However, as this method doesn't replace by position and instead does it by name, as the second and third row have the same name (nan) the final names are ('Asset','Date','Date') and when I try to merge it with the other files the final dataframe has two datetime columns.
I tried using set_levels, but I have to put the verify_integrity to False because otherwise I receive an error:
new_cols = ['Asset','Element','Date']
for i in range(3):
updated_columns = [new_cols[i]] + list(df.columns.get_level_values(i)[1:])
df.columns.set_levels(updated_columns,level=i,verify_integrity=True)
ValueError: Level values must be unique: ['Asset', 'Asset_1','Asset_1','Asset_1','Asset_1'] on level 0
But then if I mark verify_integrity as False the final names get mixed up and I don't know why.
Is there a way to replace any name in a MultiIndex with a given position? Something like iloc, for example, something like
pd.Multiindex.iloc[0,2]='Date'
Convert MultiIndex to list of tuples and set new values:
new_cols = ['Asset','Element','Date']
L = df.columns.tolist()
L[0] = tuple(new_cols)
print (L)
[('Asset', 'Element', 'Date'),
('Asset_1', 'Device_1', 'Variable_1'),
('Asset_1', 'Device_1', 'Variable_2'),
('Asset_1', 'Device_2', 'Variable_1'),
('Asset_1', 'Device_3', 'Variable_1')]
df.columns = pd.MultiIndex.from_tuples(L)
print (df)
Asset Asset_1
Element Device_1 Device_2 Device_3
Date Variable_1 Variable_2 Variable_1 Variable_1
0 2022-12-31 00:00:00 0.0 NaN 0.0 0.0
1 2022-12-31 00:05:00 0.0 NaN 0.0 0.0
2 2022-12-31 00:10:00 0.0 NaN 0.0 0.0
Or you can use helper DataFrame is also possible, but slowier:
new_cols = ['Asset','Element','Date']
df1 = df.columns.to_frame()
df1.iloc[0] = new_cols
df.columns = pd.MultiIndex.from_frame(df1, names=df.columns.names)
print (df)
Asset Asset_1
Element Device_1 Device_2 Device_3
Date Variable_1 Variable_2 Variable_1 Variable_1
0 2022-12-31 00:00:00 0.0 NaN 0.0 0.0
1 2022-12-31 00:05:00 0.0 NaN 0.0 0.0
2 2022-12-31 00:10:00 0.0 NaN 0.0 0.0