python machine-learning scikit-learn grid-search

What may be raising the error: ValueError: X has 23 features, but SVR is expecting 24 features as input?

tl;dr: I have a pipeline that does not work as expected. I am probably making some mistake in OneHoteEncode. The question is a little long because of the codes but it is probably very simple and straightforward. I can provide the full code and data upon request, but I don't think it will be needed.

The question:

I have a transformation pipeline that goes as follows:

#from sklearn import set_config
#set_config(transform_output='pandas')

class Preprocessor(TransformerMixin):
    def __init__(self):
        self._cluster_simil = ClusterSimilarity(n_clusters=10, gamma=1., random_state=42)
        #pass

    def fit(self, X, y=None):
        return self

    def transform(self, X):
        #self._cluster_simil = ClusterSimilarity(n_clusters=10, gamma=1., random_state=42)
        preprocessing = self._preprocessing()
        return preprocessing.fit_transform(X)

    def _column_ratio(self, X):
        return X[:, [0]] / X[:, [1]]

    def _ratio_name(self, function_transformer, feature_names_in):
        return ["ratio"]

    def _ratio_pipeline(self):
        return make_pipeline(
            SimpleImputer(strategy="median"),
            FunctionTransformer(self._column_ratio, feature_names_out=self._ratio_name),
        StandardScaler()
    )

    def _log_pipeline(self):
        return make_pipeline(
            SimpleImputer(strategy="median"),
            FunctionTransformer(np.log, feature_names_out="one-to-one"),
            StandardScaler()
        )

    def _cat_pipeline(self):
        return make_pipeline(
            SimpleImputer(strategy="most_frequent"),
            OneHotEncoder(handle_unknown="ignore")
        )

    def _default_num_pipeline(self):
        return make_pipeline(SimpleImputer(strategy="median"),
                             StandardScaler()
        )

    def _preprocessing(self):
        return ColumnTransformer([
            ("bedrooms", self._ratio_pipeline(), ["total_bedrooms", "total_rooms"]),
            ("rooms_per_house", self._ratio_pipeline(), ["total_rooms", "households"]),
            ("people_per_house", self._ratio_pipeline(), ["population", "households"]),
            ("log", self._log_pipeline(), ["total_bedrooms", "total_rooms", "population",
                                "households", "median_income"]),
            ("geo", self._cluster_simil, ["latitude", "longitude"]),
            ("cat", self._cat_pipeline(), make_column_selector(dtype_include=object)),
    ],
    remainder=self._default_num_pipeline())  # one column remaining: housing_median_age

class ClusterSimilarity(BaseEstimator, TransformerMixin):
    def __init__(self, n_clusters=10, gamma=1.0, random_state=None):
        self.n_clusters = n_clusters
        self.gamma = gamma
        self.random_state = random_state

    def fit(self, X, y=None, sample_weight=None):
        self.kmeans_ = KMeans(self.n_clusters, n_init=10,
                              random_state=self.random_state)
        self.kmeans_.fit(X, sample_weight=sample_weight)
        return self  # always return self!

    def transform(self, X):
        return rbf_kernel(X, self.kmeans_.cluster_centers_, gamma=self.gamma)

    def get_feature_names_out(self, names=None):
        return [f"Cluster {i} similarity" for i in range(self.n_clusters)]

But it is not working properly. When I call just for test:

preprocessor = Preprocessor()
X_train = preprocessor.fit_transform(housing)

the output for X_train.info() is exactly what it should be. But when I try a gridSearch with:

svr_pipeline = Pipeline([("preprocessing", preprocessor), ("svr", SVR())])
grid_search = GridSearchCV(svr_pipeline, param_grid, cv=3,
                           scoring='neg_root_mean_squared_error')
                       
grid_search.fit(housing.iloc[:5000], housing_labels.iloc[:5000])

it outputs a warning:

ValueError: The feature names should match those that were passed during fit.
Feature names seen at fit time, yet now missing:
- cat__ocean_proximity_ISLAND

UserWarning: One or more of the test scores are non-finite:
[nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan
 nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan
 nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan
 nan nan nan nan nan nan nan nan]

The problem is probabily here:

ValueError: X has 23 features, but SVR is expecting 24 features as input.

The shape of the correct X after passing the pipelin is X.shape: (16521, 24). As it should be. This is the correct shape expected after the transformation.

That is, I have 24 features after the transformation pipeline. But somehow, when calling the Preprocessor class SVR is only seeing 23 features, not all of the 24: the one that is missing is the ocean_proximity_ISLAND which has only few values in the dataset. That is why when running the gridsearch in only the first 100 or 1000 lines of the dataset it gives no problems, but when running on a sufficient number of lines that the ocean_proximity_ISLAND is seen, it raises this problem.

This warning repeats at every step of the GridSearch and the column it is pointing to is always the same cat__ocean_proximity_ISLAND which comes from the def _cat_pipeline(self): part of the pipeline and is a result of using OneHotEncode

Is the problem really on OneHotEncode? If so, why does it only raises error after the gridSearch? How to fix it and avoid such an error in the future?emphasized text

Solution

It could be that there are just a few samples with ocean_proximity='ISLAND', such that on some splits it doesn't appear in the training set and only appears in the validation set. This would lead to an error condition, as the encoder is seeing categories in the validation set that it was not fit on at train time.

One way round this is to tell OneHotEncoder in advance what all the possible categories are using the categories= parameter. This is different to the default configuration (categories='auto') where it figures out the categories based on the training data it sees. You might use something like:

OneHotEncoder(
  ..., categories=['cat__ocean_proximity_<1H OCEAN',
                   'cat__ocean_proximity_INLAND',
                   'cat__ocean_proximity_ISLAND',
                   'cat__ocean_proximity_NEAR BAY',
                   'cat__ocean_proximity_NEAR OCEAN']
)

This way, even if a category doesn't show up in the training split, the encoder can still handle it properly when it comes up in the validation split.