I have managed to formulate a dataframe having ~200 columns each having a subcolumn. Now after some calculations, a separate set of values have been stored in the same dataframe with the same column names. Hence they are stored with a different subcolumn name. My query is how to group these two sets of (column, subcolumn) pairs together.
The following is a smaller subset of a similar representation (What I have):
Date Column1 Column2 Column3 Column1 Column2 Column3
Num Num Num ProdID ProdID ProdID
0 2015-01-02 100 106 111 AA AA AD
1 2015-01-02 105 118 109 AB AA AB
2 2015-01-02 102 173 188 AC AC AC
3 2015-01-02 112 101 105 AD AB AE
4 2015-01-02 139 191 156 AE AE AC
5 2015-01-02 126 172 117 AF AE AF
The following is what I want :
Date Column1 Column2 Column3
Num ProdID Num ProdID Num ProdID
0 2015-01-02 100 AA 106 AA 111 AD
1 2015-01-02 105 AB 118 AA 109 AB
2 2015-01-02 102 AC 173 AC 188 AC
3 2015-01-02 112 AD 101 AB 105 AE
4 2015-01-02 139 AE 191 AE 156 AC
5 2015-01-02 126 AF 172 AE 117 AF
I have tried the basic groupby statements like
df = df.groupby(df.columns, axis=1)
but that doesn't work. I think it's something easy but can't figure it out
We can try with reindex
:
df = df.reindex(columns=[df.columns[0], *sorted(df.columns[1:])])
df.columns[0]
is Date
which should not be sorted. Then the rest of the columns get sorted
df.columns[1:]
.
Sample Data
import numpy as np
import pandas as pd
rows = 3
df = pd.DataFrame(
np.arange(1, (rows * 5) + 1).reshape(-1, 5),
columns=pd.MultiIndex.from_arrays([
['Date', 'Column1', 'Column2', 'Column1', 'Column2'],
['', 'Num', 'Num', 'ProdID', 'ProdID']
])
)
df.iloc[:, 0] = pd.date_range(start='2021-01-01', periods=len(df), freq='D')
Before:
Date Column1 Column2 Column1 Column2
Num Num ProdID ProdID
0 2021-01-01 2 3 4 5
1 2021-01-02 7 8 9 10
2 2021-01-03 12 13 14 15
After reindex
:
Date Column1 Column2
Num ProdID Num ProdID
0 2021-01-01 2 4 3 5
1 2021-01-02 7 9 8 10
2 2021-01-03 12 14 13 15