python-3.x machine-learning scikit-learn xgboost

Recovering features names of StandardScaler().fit_transform() with sklearn

Edited from a tutorial in Kaggle, I try to run the code below and data (available to download from here):

Code:

import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt  # for plotting facilities
from datetime import datetime, date
from sklearn.model_selection import TimeSeriesSplit, GridSearchCV
import xgboost as xgb
from sklearn.metrics import mean_squared_error, mean_absolute_error
import math
from sklearn.preprocessing import StandardScaler

df = pd.read_csv("./data/Aquifer_Petrignano.csv")

df['Date'] = pd.to_datetime(df.Date, format = '%d/%m/%Y')
df = df[df.Rainfall_Bastia_Umbra.notna()].reset_index(drop=True)

df = df.interpolate(method ='ffill')
df = df[['Date', 'Rainfall_Bastia_Umbra', 'Depth_to_Groundwater_P24', 'Depth_to_Groundwater_P25', 'Temperature_Bastia_Umbra', 'Temperature_Petrignano', 'Volume_C10_Petrignano', 'Hydrometry_Fiume_Chiascio_Petrignano']].resample('7D', on='Date').mean().reset_index(drop=False)

X = df.drop(['Depth_to_Groundwater_P24','Depth_to_Groundwater_P25','Date'], axis=1)
y1 = df.Depth_to_Groundwater_P24
y2 = df.Depth_to_Groundwater_P25

scaler = StandardScaler()
X = scaler.fit_transform(X)

model = xgb.XGBRegressor()
param_search = {'max_depth': range(1, 2, 2),
                'min_child_weight': range(1, 2, 2),
                'n_estimators' : [1000],
                'learning_rate' : [0.1]}

tscv = TimeSeriesSplit(n_splits=2)
gsearch = GridSearchCV(estimator=model, cv=tscv,
                        param_grid=param_search)
gsearch.fit(X, y1)

xgb_grid = xgb.XGBRegressor(**gsearch.best_params_)
xgb_grid.fit(X, y1)

ax = xgb.plot_importance(xgb_grid)
ax.figure.tight_layout()
ax.figure.savefig('test.png')

y_val = y1[-80:]
X_val = X[-80:]

y_pred = xgb_grid.predict(X_val)
print(mean_absolute_error(y_val, y_pred))
print(math.sqrt(mean_squared_error(y_val, y_pred)))

I plotted a features importance figure whose original features names are hidden:

If I comment out these two lines:

scaler = StandardScaler()
X = scaler.fit_transform(X)

I get the output:

How could I use scaler.fit_transform() for X and get a feature importance plot with the original feature names?

Solution

The reason behind this is that StandardScaler returns a numpy.ndarray of your feature values (same shape as pandas.DataFrame.values, but not normalized) and you need to convert it back to pandas.DataFrame with the same column names.

Here's the part of your code that needs changing.

scaler = StandardScaler()
X = pd.DataFrame(scaler.fit_transform(X), columns=X.columns)