I am having trouble with a piece of code I am writing. Specifically a pipeline. The data is a simple numerical dataframe (firewall logs) which is being split in X_train and X_test very commonly. After splitting, I devised a pipeline. This pipeline has 3 steps:
Then, I run a pipeline through a gridsearCV(), fit() the grid-search itself, and then fit the pipeline with the best parameters. The problem appears when I try to transform the test set with the fitted pipeline:
The pipeline I am using to fit the testing data is as follows:
test_pipe_transform = Pipeline(
steps = [
('preprocessor', final_pipe.named_steps['preprocessor']),
('scaler' , final_pipe.named_steps['PCA']),
])
I make this pipeline specifically to transform the test set using the fitted steps from the main pipeline. It seems that I cannot transform my testing data with the fitted pipeline. the error is showing:
self._check_n_features(X, reset=False)
File "C:\Users\............\lib\site-packages\sklearn\base.py", line 359, in _check_n_features
raise ValueError(
ValueError: X has 10 features, but ColumnTransformer is expecting 11 features as input.
What is happening in here? Can somebody give me a hint on what can be going wrong?
The complete code below:
import pandas as pd
import numpy as np
import warnings
warnings.filterwarnings('ignore')
# import dependencies
import pandas as pd
from typing import Any, List, Tuple
from sklearn.ensemble import RandomForestClassifier
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import (
OrdinalEncoder,
MinMaxScaler,
PowerTransformer,
FunctionTransformer,
)
# Classifier
from sklearn.metrics import classification_report
from sklearn.decomposition import PCA
from sklearn.feature_selection import SelectPercentile, chi2
from sklearn.model_selection import (
GridSearchCV,
train_test_split,
RandomizedSearchCV)
def get_categorical_columns(df):
categorical_cols = df.select_dtypes(include=['object', 'category']).columns.tolist()
return categorical_cols
def get_numerical_columns(df):
numerical_cols = []
for col in df.columns:
if pd.api.types.is_numeric_dtype(df[col]):
numerical_cols.append(col)
return numerical_cols
if __name__ == '__main__':
data = pd.read_csv(filepath_or_buffer='DATA\log2.csv')
X = data.drop(['Action'], axis=1)
y = data["Action"]
X_train, X_test,\
y_train, y_test = train_test_split(\
X,
y,
shuffle = False,
stratify = None,
test_size = 0.5,
random_state = 0)
categorical_features = get_categorical_columns(data)
numeric_features = get_numerical_columns(data)
####### BLOCK FOR NUMERIC INPUTER OF MISSING VALUES ########
numeric_inputer = \
Pipeline(
steps = [
("imputer", SimpleImputer(strategy = "median")),
#("scaler" , StandardScaler())
])
########## BLOCK FOR CATEGORIAL INPUTER OF MISSING VALUES ##
categorical_inputer = \
Pipeline(
steps = [
('imputer', SimpleImputer(
strategy = 'constant',
fill_value = 'missing')),
('label_encoder', OrdinalEncoder()),
#("selector", SelectPercentile(chi2, percentile = 50)),
])
############# BLOCK FOR SCALING pkts_received ##############
def log_transform(x):
return np.log10(x+10)
logtransformer = FunctionTransformer(log_transform ,validate = True)
scaler = PowerTransformer(method='yeo-johnson', standardize = True)
scaler_2 = MinMaxScaler()
pipe_pkt_received = \
Pipeline(
steps = [
('log1_transform' , logtransformer),
('scaler' , scaler ),
('min_max_scaler' , scaler_2 ),
])
##################### PREPROCESSOR ########################
############################################################
## Applying Column transformer pipelines ################
preprocessor = ColumnTransformer(
transformers = [
("Droping_Bytes_Received" , "drop" , ["Bytes Received"] ),
("Droping_Bytes" , "drop" , ["Bytes"] ),
("Droping_Packets" , "drop" , ["Packets"] ),
("num" , numeric_inputer , numeric_features ),
("pkt_received_scaling" , pipe_pkt_received , ["pkts_received"] ),
#("cat" , categorical_inputer, categorical_features),
],
remainder = 'passthrough',
)
############################################################
############################################################
##################### FINAL PIPELINE ######################
############################################################
step_1 = ("preprocessor", preprocessor)
step_2 = ("PCA" , PCA(n_components = 10))
step_3 = ("RNDF_clf" , RandomForestClassifier())
final_pipe = \
Pipeline(
steps = [
step_1,
step_2,
step_3,
])
param_grid = {"PCA__n_components" : [5, 10],}
grid_search = GridSearchCV(
estimator = final_pipe ,
param_grid = param_grid ,
cv = 3 ,
n_jobs = -1 ,
verbose = 2 ,)
grid_search.fit(X_train, y_train)
# use best parameters to transform test data
best_params = grid_search.best_params_
final_pipe.set_params(**best_params)
final_pipe.fit(X_train, y_train)
test_pipe_transform = Pipeline(
steps = [
('preprocessor', final_pipe.named_steps['preprocessor']),
('scaler' , final_pipe.named_steps['PCA']),
])
X_test_transformed = test_pipe_transform.transform(X_test)
# evaluate model on test data using multiple metrics
y_pred = final_pipe.predict(X_test_transformed)
report = classification_report(y_test, y_pred)
final_pipe
contains your preprocessing steps, so final_pipe.predict
performs those steps, so you should not pass X_test_transformed
to that function.
Some other comments:
final_pipe
using the best parameters from the hyperparameter search: that's done by default since refit=True
in the search. You can access the refitted pipeline as grid_search.best_estimator_
.test_pipe_transform
so explicitly. You can slice pipelines: final_pipe[:-1]
has all the steps except the last (so all the preprocessing without the model), and can transform
by itself. (If you follow (1) then final_pipe
won't be fitted, having been cloned in the search, but the best_estimator_
will work.)