Search code examples
pythonpandasmachine-learningscikit-learnstatsmodels

More efficient way to mean center a sub-set of columns in a pandas dataframe and retain column names


I have a dataframe that has about 370 columns. I'm testing a series of hypothesis that require me to use subsets of the model to fit a cubic regression model. I'm planning on using statsmodels to model this data.

Part of the process for polynomial regression involves mean centering variables (subtracting the mean from every case for a particular feature).

I can do this with 3 lines of code but it seems inefficient, given that I need to replicate this process for half a dozen hypothesis. Keep in mind that I need to data at the coefficient level from the statsmodel output so I need to retain the column names.

Here's a peek at the data. It's the sub-set of columns I need for one of my hypothesis tests.

      i  we  you  shehe  they  ipron
0  0.51   0    0   0.26  0.00   1.02
1  1.24   0    0   0.00  0.00   1.66
2  0.00   0    0   0.00  0.72   1.45
3  0.00   0    0   0.00  0.00   0.53

Here is the code that mean centers and keeps the column names.

from sklearn import preprocessing
#create df of features for hypothesis, from full dataframe
h2 = df[['i', 'we', 'you', 'shehe', 'they', 'ipron']]

#center the variables
x_centered = preprocessing.scale(h2, with_mean='True', with_std='False')

#convert back into a Pandas dataframe and add column names
x_centered_df = pd.DataFrame(x_centered, columns=h2.columns)

Any recommendations on how to make this more efficient / faster would be awesome!


Solution

  • df.apply(lambda x: x-x.mean())
    
    %timeit df.apply(lambda x: x-x.mean())
    1000 loops, best of 3: 2.09 ms per loop
    
    df.subtract(df.mean())
    
    %timeit df.subtract(df.mean())
    1000 loops, best of 3: 902 µs per loop
    

    both yielding:

            i  we  you  shehe  they  ipron
    0  0.0725   0    0  0.195 -0.18 -0.145
    1  0.8025   0    0 -0.065 -0.18  0.495
    2 -0.4375   0    0 -0.065  0.54  0.285
    3 -0.4375   0    0 -0.065 -0.18 -0.635