Search code examples
pythonmachine-learningscikit-learndata-scienceone-hot-encoding

ValueError: Shape mismatch: if categories is an array, The error is not resolved even after specifying the columns as indexes


    trf1=ColumnTransformer([("Infuse_val",SimpleImputer(strategy="mean"),[0])],remainder="passthrough")
    trf4=ColumnTransformer([("One_hot",OneHotEncoder(sparse=False,handle_unknown="ignore"),[1,4])],remainder="passthrough")
    trf2=ColumnTransformer([("Ord_encode",OrdinalEncoder(categories=["Strong","Mild"]),[3])],remainder="passthrough")
    trf3=ColumnTransformer([("scale",StandardScaler(),[0,2])],remainder="passthrough")
    pipe = Pipeline([
        ('trf1',trf1),
        ('trf2',trf2),
        ('trf3',trf3),
        ('trf4',trf4),
    ])
    pipe.fit(x_train,y_tarin)

Error

ValueError: Shape mismatch: if categories is an array, it has to be of shape (n_features,).

The table is

enter image description here

I don't understand what's the error here in my code?


Solution

  • The error isn't about the column transformers, it's about the OrdinalEncoder. categories needs to be a list of lists: for each column, the list of categories in that column. Since you have just one column, categories=[["Strong","Mild"]] should work.

    With just two categories, most subsequent algorithms won't care which one is 0 or 1, so here you could just use the default auto.

    Finally, you'll have problems with your column transformers. The change the order (and names) of the columns, so by the end of the pipeline, scaling columns 0 and 2 might not be the two numeric columns. The column order is predictable (transformers in order followed by passthrough), so you could manually keep track. But I would suggest a single column transformer with multiple pipelines instead.