Search code examples
pythonpandasdataframescikit-learnsklearn-pandas

How can I convert the StandardScaler() transformation back to dataframe?


I'm working with a model, and after splitting into train and test, I want to apply StandardScaler(). However, this transformation converts my data into an array and I want to keep the format I had before. How can I do this?

Basically, I have:

from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split

X = df[features]
y = df[["target"]]

X_train, X_test, y_train, y_test = train_test_split(
    X, y, train_size=0.7, random_state=42
)

sc = StandardScaler()
X_train_sc = sc.fit_transform(X_train)
X_test_sc = sc.transform(X_test)

How can I get X_train_sc back to the format that X_train had?

Update: I don't want to get X_train_sc to reverse back to before being scaled. I just want X_train_sc to be a dataframe in the easiest possible way.


Solution

  • As you mentioned, applying the scaling results in a numpy array, to get a dataframe you can initialize a new one:

    import pandas as pd
    
    cols = X_train.columns
    sc = StandardScaler()
    X_train_sc = pd.DataFrame(sc.fit_transform(X_train), columns=cols)
    X_test_sc = pd.DataFrame(sc.transform(X_test), columns=cols)
    

    2022 Update

    As of scikit-learn version 1.2.0, it is possible to use the set_output API to configure transformers to output pandas DataFrames (check the doc example)

    The above example would simplify as follows:

    import pandas as pd
    
    cols = X_train.columns
    sc = StandardScaler().set_output(transform="pandas")
    X_train_sc = sc.fit_transform(X_train)
    X_test_sc = sc.transform(X_test)