Search code examples
machine-learningscikit-learndata-sciencesklearn-pandas

how to predict new values when a machine learning model was standardized StandardScaler


I'm working on a machine learning model, I have a dataframe with the data

I normalize the data with a standard distribution

scaler = StandardScaler()
df = scaler.fit_transform(df)

I divide the datasets into target and characteristics

X_df = df[X_characteristics_list]
y_df = df[target]

I split into train and test then I train the model

X_train, X_test, y_train, y_test = train_test_split(X_df, y_df, test_size = 0.25)
forest = RandomForestRegressor()
forest.fit(X_train, y_train)

I predict the test to validate the effectiveness

y_test_pred = forest.predict(X2_test)
mse = mean_squared_error(y_test, y_test_pred)

But when is time to test in real life I need to leave the model ready to predict

If i Want to predict just one record let say [100,20,34] I can't because I need the record standardized, and transform it with StandardScaler does not work because it depends on standard deviation so I would need the original dataset

What's the best way to solve this problem.


Solution

  • See below:

    >>> from sklearn.datasets import make_classification
    >>> from sklearn.model_selection import train_test_split
    >>> from sklearn.linear_model import LogisticRegression
    >>> from sklearn.preprocessing import StandardScaler
    # Create our input and output matrices
    >>> X, y = make_classification()
    # Split train-test... "test" will be production/unobserved/"real-life" data
    >>> X_train, X_test, y_train, y_test = train_test_split(X, y)
    # What does X_train look like?
    >>> X_train
    array([[-0.08930702, -2.71113991, -0.93849926, ...,  0.21650905,
             0.68952722,  0.61365789],
           [-0.31143977, -1.87817904,  0.08287492, ..., -0.41332943,
            -0.58967179,  1.7239411 ],
           [-1.62287589,  1.10691318, -0.630556  , ..., -0.35060008,
             1.11270562,  0.08106694],
           ...,
           [-0.59797041,  0.90218081,  0.89983074, ..., -0.54374315,
             1.18534841, -0.03397969],
           [-1.2006559 ,  1.01890955, -1.21617181, ...,  1.76263322,
             1.38280423, -1.0192972 ],
           [ 0.11883425,  1.42952643, -1.23647358, ...,  1.02509208,
            -1.14308885,  0.72096531]])
    # Let's scale it
    >>> scaler = StandardScaler()
    >>> X_train = scaler.fit_transform(X_train)
    >>> X_train
    array([[ 0.08867642, -1.97950269, -1.1214106 , ...,  0.22075623,
             0.57844552,  0.46487917],
           [-0.10736984, -1.34896243,  0.00808597, ..., -0.37670234,
            -0.6045418 ,  1.57819736],
           [-1.26479555,  0.91071257, -0.78086855, ..., -0.3171979 ,
             0.96979563, -0.06916763],
           ...,
           [-0.36025134,  0.7557329 ,  0.91152449, ..., -0.50041152,
             1.03697478, -0.18452874],
           [-0.89215959,  0.84409499, -1.42847749, ...,  1.68739437,
             1.21957946, -1.17253964],
           [ 0.27237431,  1.15492649, -1.4509284 , ...,  0.98777012,
            -1.116335  ,  0.57247992]])
    # Fit the model
    >>> model = LogisticRegression()
    >>> model.fit(X_train, y_train)
    LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                       intercept_scaling=1, l1_ratio=None, max_iter=100,
                       multi_class='auto', n_jobs=None, penalty='l2',
                       random_state=None, solver='lbfgs', tol=0.0001, verbose=0,
                       warm_start=False)
    # Now let's use the already-fitted StandardScaler object to simply transform
    # *not fit_transform* the test data
    >>> X_test = scaler.transform(X_test)
    >>> model.predict(X_test)
    array([1, 0, 1, 0, 0, 0, 0, 1, 1, 1, 1, 0, 1, 0, 1, 1, 1, 1, 0, 1, 1, 0,
           0, 0, 0])
    

    Note that using joblib or pickle you can save the scaler object and re-load it for scaling in "real-time" later on.