Search code examples
pythonscikit-learnone-hot-encoding

One-hot-encoding with missing categories


I have a dataset with a category column. In order to use linear regression, I 1-hot encode this column.

My set has 10 columns, including the category column. After dropping that column and appending the 1-hot encoded matrix, I end up with 14 columns (10 - 1 + 5).

So I train (fit) my LinearRegression model with a matrix of shape (n, 14).

After training it, I want to test it on a subset of the training set, so I take only the 5 first and put them through the same pipeline. But these 5 first only contain 3 of the categories. So after going through the pipeline, I'm only left with a matrix of shape (n, 13) because it's missing 2 categories.

How can I force the 1-hot encoder to use the 5 categories ?

I'm using LabelBinarizer from sklearn.


Solution

  • The error is to "put the test data through the same pipeline". Basically i was doing:

    data_prepared = full_pipeline.fit_transform(train_set)
    
    lin_reg = LinearRegression()
    lin_reg.fit(data_prepared, labels)
    
    some_data = train_set.iloc[:5]
    some_data_prepared = full_pipeline.fit_transform(some_data)
    
    lin_reg.predict(some_data_prepared)
    # => error because mismatching shapes
    

    The problematic line is:

    some_data_prepared = full_pipeline.fit_transform(some_data)
    

    By doing fit_transform, I'll fit the LabelBinarizer to a set containing only 3 labels. Instead I should do:

    some_data_prepared = full_pipeline.transform(some_data)
    

    This way I'm using the pipeline fitted by the full set (train_set) and transform it in the same way.

    Thanks @Vivek Kumar