Search code examples
pythonscikit-learnscikit-learn-pipeline

Value error using scikit-learn transformers


I am having trouble with a piece of code I am writing. Specifically a pipeline. The data is a simple numerical dataframe (firewall logs) which is being split in X_train and X_test very commonly. After splitting, I devised a pipeline. This pipeline has 3 steps:

  1. ColumnTransformer(...some stuff going on here...)
  2. PCA(n=10)
  3. Randomforestclassifier()

Then, I run a pipeline through a gridsearCV(), fit() the grid-search itself, and then fit the pipeline with the best parameters. The problem appears when I try to transform the test set with the fitted pipeline:

The pipeline I am using to fit the testing data is as follows:

test_pipe_transform = Pipeline(
    steps = [
        ('preprocessor', final_pipe.named_steps['preprocessor']),
        ('scaler'      , final_pipe.named_steps['PCA']),
    ])

I make this pipeline specifically to transform the test set using the fitted steps from the main pipeline. It seems that I cannot transform my testing data with the fitted pipeline. the error is showing:

self._check_n_features(X, reset=False)
  File "C:\Users\............\lib\site-packages\sklearn\base.py", line 359, in _check_n_features
    raise ValueError(
ValueError: X has 10 features, but ColumnTransformer is expecting 11 features as input.

What is happening in here? Can somebody give me a hint on what can be going wrong?

The complete code below:

import pandas as pd
import numpy  as np

import warnings
warnings.filterwarnings('ignore')
# import dependencies
import pandas                as pd
from typing                  import Any, List, Tuple

from sklearn.ensemble        import RandomForestClassifier
from sklearn.pipeline        import Pipeline
from sklearn.compose         import ColumnTransformer
from sklearn.impute          import SimpleImputer
from sklearn.preprocessing   import (
    OrdinalEncoder,
    MinMaxScaler,
    PowerTransformer,
    FunctionTransformer,
)   

# Classifier
from sklearn.metrics       import classification_report
from sklearn.decomposition import PCA
from sklearn.feature_selection import SelectPercentile, chi2
from sklearn.model_selection   import (
    GridSearchCV,
    train_test_split,
    RandomizedSearchCV)


def get_categorical_columns(df):
    categorical_cols = df.select_dtypes(include=['object', 'category']).columns.tolist()
    return categorical_cols


def get_numerical_columns(df):
    numerical_cols = []
    for col in df.columns:
        if pd.api.types.is_numeric_dtype(df[col]):
            numerical_cols.append(col)
    return numerical_cols


if __name__ == '__main__':

    data = pd.read_csv(filepath_or_buffer='DATA\log2.csv')
    X = data.drop(['Action'], axis=1)
    y = data["Action"]
    
    X_train, X_test,\
        y_train, y_test = train_test_split(\
            X, 
            y,
            shuffle  = False,
            stratify = None, 
            test_size    = 0.5, 
            random_state = 0)
    

    categorical_features = get_categorical_columns(data)
    numeric_features     = get_numerical_columns(data)

    ####### BLOCK FOR NUMERIC INPUTER OF MISSING VALUES ########
    numeric_inputer = \
        Pipeline(
            steps = [
                ("imputer", SimpleImputer(strategy = "median")),
                #("scaler" , StandardScaler())
            ])    
    ########## BLOCK FOR CATEGORIAL INPUTER OF MISSING VALUES ##
    categorical_inputer = \
        Pipeline(
            steps = [
                ('imputer', SimpleImputer(
                    strategy   = 'constant',
                    fill_value = 'missing')),
                ('label_encoder', OrdinalEncoder()),
                #("selector", SelectPercentile(chi2, percentile = 50)),
            ])    

    ############# BLOCK FOR SCALING pkts_received ##############
    def log_transform(x):
        return np.log10(x+10)
    
    logtransformer = FunctionTransformer(log_transform    ,validate     = True)
    scaler         = PowerTransformer(method='yeo-johnson', standardize = True)
    scaler_2       = MinMaxScaler()
    pipe_pkt_received = \
        Pipeline(
            steps = [
                ('log1_transform' , logtransformer),
                ('scaler'         , scaler        ),
                ('min_max_scaler' , scaler_2      ),
            ])


    #####################  PREPROCESSOR ########################
    ############################################################
    ##   Applying Column transformer pipelines  ################

    preprocessor = ColumnTransformer(
        transformers = [
            ("Droping_Bytes_Received"       , "drop"             , ["Bytes Received"]  ),
            ("Droping_Bytes"                , "drop"             , ["Bytes"]           ),            
            ("Droping_Packets"              , "drop"             , ["Packets"]         ),
            ("num"                          , numeric_inputer    , numeric_features    ),
            ("pkt_received_scaling"         , pipe_pkt_received  , ["pkts_received"]   ),
            #("cat"                         , categorical_inputer, categorical_features),
        ],
        remainder = 'passthrough',
    )


    ############################################################
    ############################################################
    #####################  FINAL PIPELINE ######################
    ############################################################

    step_1 = ("preprocessor", preprocessor)
    step_2 = ("PCA"         , PCA(n_components = 10))
    step_3 = ("RNDF_clf"    , RandomForestClassifier())

    final_pipe = \
        Pipeline(
            steps = [
                step_1,
                    step_2,
                        step_3,
        ])

    param_grid  = {"PCA__n_components" : [5, 10],}
    grid_search = GridSearchCV(
        estimator  = final_pipe ,
        param_grid = param_grid , 
        cv         = 3          ,
        n_jobs     = -1         ,
        verbose    = 2          ,)
    grid_search.fit(X_train, y_train)

    # use best parameters to transform test data
    best_params = grid_search.best_params_
    final_pipe.set_params(**best_params)
    final_pipe.fit(X_train, y_train)
    
    test_pipe_transform = Pipeline(
        steps = [
            ('preprocessor', final_pipe.named_steps['preprocessor']),
            ('scaler'      , final_pipe.named_steps['PCA']),
        ])
    
    X_test_transformed = test_pipe_transform.transform(X_test)
    
    # evaluate model on test data using multiple metrics
    y_pred = final_pipe.predict(X_test_transformed)
    report = classification_report(y_test, y_pred)

Solution

  • final_pipe contains your preprocessing steps, so final_pipe.predict performs those steps, so you should not pass X_test_transformed to that function.


    Some other comments:

    1. You don't need to refit final_pipe using the best parameters from the hyperparameter search: that's done by default since refit=True in the search. You can access the refitted pipeline as grid_search.best_estimator_.
    2. You don't need to rebuild test_pipe_transform so explicitly. You can slice pipelines: final_pipe[:-1] has all the steps except the last (so all the preprocessing without the model), and can transform by itself. (If you follow (1) then final_pipe won't be fitted, having been cloned in the search, but the best_estimator_ will work.)