Search code examples
python-3.xmachine-learningscikit-learnxgboost

Recovering features names of StandardScaler().fit_transform() with sklearn


Edited from a tutorial in Kaggle, I try to run the code below and data (available to download from here):

Code:

import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt  # for plotting facilities
from datetime import datetime, date
from sklearn.model_selection import TimeSeriesSplit, GridSearchCV
import xgboost as xgb
from sklearn.metrics import mean_squared_error, mean_absolute_error
import math
from sklearn.preprocessing import StandardScaler

df = pd.read_csv("./data/Aquifer_Petrignano.csv")

df['Date'] = pd.to_datetime(df.Date, format = '%d/%m/%Y')
df = df[df.Rainfall_Bastia_Umbra.notna()].reset_index(drop=True)

df = df.interpolate(method ='ffill')
df = df[['Date', 'Rainfall_Bastia_Umbra', 'Depth_to_Groundwater_P24', 'Depth_to_Groundwater_P25', 'Temperature_Bastia_Umbra', 'Temperature_Petrignano', 'Volume_C10_Petrignano', 'Hydrometry_Fiume_Chiascio_Petrignano']].resample('7D', on='Date').mean().reset_index(drop=False)

X = df.drop(['Depth_to_Groundwater_P24','Depth_to_Groundwater_P25','Date'], axis=1)
y1 = df.Depth_to_Groundwater_P24
y2 = df.Depth_to_Groundwater_P25

scaler = StandardScaler()
X = scaler.fit_transform(X)

model = xgb.XGBRegressor()
param_search = {'max_depth': range(1, 2, 2),
                'min_child_weight': range(1, 2, 2),
                'n_estimators' : [1000],
                'learning_rate' : [0.1]}

tscv = TimeSeriesSplit(n_splits=2)
gsearch = GridSearchCV(estimator=model, cv=tscv,
                        param_grid=param_search)
gsearch.fit(X, y1)

xgb_grid = xgb.XGBRegressor(**gsearch.best_params_)
xgb_grid.fit(X, y1)

ax = xgb.plot_importance(xgb_grid)
ax.figure.tight_layout()
ax.figure.savefig('test.png')

y_val = y1[-80:]
X_val = X[-80:]

y_pred = xgb_grid.predict(X_val)
print(mean_absolute_error(y_val, y_pred))
print(math.sqrt(mean_squared_error(y_val, y_pred)))

I plotted a features importance figure whose original features names are hidden:

enter image description here

If I comment out these two lines:

scaler = StandardScaler()
X = scaler.fit_transform(X)

I get the output:

enter image description here

How could I use scaler.fit_transform() for X and get a feature importance plot with the original feature names?


Solution

  • The reason behind this is that StandardScaler returns a numpy.ndarray of your feature values (same shape as pandas.DataFrame.values, but not normalized) and you need to convert it back to pandas.DataFrame with the same column names.

    Here's the part of your code that needs changing.

    scaler = StandardScaler()
    X = pd.DataFrame(scaler.fit_transform(X), columns=X.columns)