Search code examples
pythonpandasdataframemulti-index

Pandas dataframe hierarchical index columns - Scale slice of dataframe


I have a pandas dataframe in which I have both continuous and discrete data. I'm separating the continous from the discrete data by using an hierarchical column index to process them in a different way.

Let's assume the following dataframe:

At first I'm building an hierarchical index, that I can use when I then initialize the empty dataframe. Then I'm filling the sample dataframe with continous and discrete data.

my_multiindex = pd.MultiIndex(levels=[['discrete', 'continuous'],[]],
                              labels=[[],[]],
                              names=['Datatype', 'Data']) 

df = pd.DataFrame(columns=my_multiindex,index=pd.date_range('20180101 9:01:00',freq='s',periods=50))

df.loc[:,('continuous', 'data1')] = np.random.uniform(-50, 100, 50)
df.loc[:,('continuous', 'data2')] = np.random.uniform(-500, 800, 50)
df.loc[:,('continuous', 'data3')] = np.random.uniform(-300, 200, 50)

df.loc[:,('discrete', 'data4')] = np.random.randint(0, 2, 50)
df.loc[:,('discrete', 'data5')] = np.random.randint(0, 2, 50)
df.loc[:,('discrete', 'data6')] = np.random.randint(0, 2, 50)

The resulting dataframe looks like this:

Datatype            continuous                         discrete            
Data                     data1       data2       data3    data4 data5 data6
2018-01-01 09:01:00  -6.406635  208.992257    1.845863        0     1     0
2018-01-01 09:01:01  80.084693  264.907285  190.628975        0     0     0
2018-01-01 09:01:02  74.734951  -39.456532 -132.749219        0     1     0
2018-01-01 09:01:03  55.703297  413.700692  147.826342        1     0     0
2018-01-01 09:01:04  25.797465 -290.116077 -250.018855        0     0     1

Then I want to use StandardScaler to scale the continous data in the dataframe. The discrete data I'm just trying to copy to the new dataframe df_scaled.

df_scaled = pd.DataFrame(columns=my_multiindex, index=df.index)
scaler = StandardScaler()
scaler.fit(df.loc[:,'continuous'])

df_scaled.loc[:,'discrete'] = df.loc[:,'discrete']
df_scaled.loc[:,'continuous'] = pd.DataFrame(scaler.transform(df.loc[:,'continuous']))

However, this is not working. The new dataframe df_scaled is still empty after running this. What am I doing wrong?


Solution

  • I'm not really sure why your approach is not working. I guess Multiindexes are still kind of experimental in pandas. However there is a simple workaraound:

    df_scaled = df.copy()
    df_scaled.loc[:,'continuous'] = scaler.transform(df.loc[:,'continuous'])
    

    If you want to know in more detail why your approach is not working maybe you should consider to post this on pandas github issue tracking and ask if this is a bug or feature.