I am having a dataset in the following shape:
type_of_bicycle_lane | month | other variables |
---|---|---|
'on the street' | 1 | ... |
'fully seperated' | 4 | ... |
'fully seperated' | 8 | ... |
'fully seperated' | 1 | ... |
'other_1' | 12 | ... |
'other_2' | 8 | ... |
I am trying to pass this trough a scikit learn pipeline, specifically through a OneHotEncoder to use it for various ML models. However, for the type_of_bicycle_lane
I only want to create the OneHotEncoding for the categories = ['on the street', 'fully seperated']
.
Thus, I wrote the following code:
full_pipeline = ColumnTransformer([
("bicycle_lane", OneHotEncoder(categories = ['on the street', 'fully seperated']), ["type_of_bicycle_lane"]),
])
However, I keep receiving the Error: Shape mismatch: if categories is an array, it has to be of shape (n_features,).
I consolidated the documentation and I am meant to the parameter categories should be a list of array-like where "categories[i] holds the categories expected in the ith column. The passed categories should not mix strings and numeric values within a single feature, and should be sorted in case of numeric values." Link to documentation
I have used this OneHotEncoder before (such as
("month", OneHotEncoder(categories = [range(1,13)]), ["month"])
which worked perfectly fine.
What might I be doing wrong? Thank you very much in advance!
Categories should be a list of array. Additionnaly you should indicate handle_unknown='ignore' or 'infrequent_if_exist' to avoid an error message.
full_pipeline = ColumnTransformer([
("bicycle_lane", OneHotEncoder(categories = [['on the street', 'fully seperated']], handle_unknown='ignore'), ["type_of_bicycle_lane"]),
])