Search code examples
pythonmachine-learningdeep-learningnumpy-ndarrayone-hot-encoding

Why does OneHotEncoder only work for up to 5 different categorical variable values?


I have noticed OneHotEncoder fails when a categorical variable column has 6 or more categories. For instance, I have this TestData.csv file that has two columns: Geography, and Continent. Geography's distinct values are France, Spain, Kenya, Botswana, and Nigeria, while Continent's distinct values are Europe, and Africa. My goal is to encode the Geography column using OneHotEncoder. I perform the following code to do this:

import numpy as np
import pandas as pd

#Importing the dataset
dataset = pd.read_csv('TestData.csv')
X = dataset.iloc[:,:].values #X is hence a 2-dimensional numpy.ndarray

#Encoding categorical column Geography
from sklearn.preprocessing import OneHotEncoder 
from sklearn.compose import ColumnTransformer
ct = ColumnTransformer(transformers=[('encoder', OneHotEncoder(), [0])], remainder='passthrough') #the 0 is the column index for the categorical column we want to encode, in this case Geography
X = np.array(ct.fit_transform(X))

I then print(X) to make sure I get the expected output which I do and it looks like this (also notice the Size of X): notice the size of X

However, if I add one new country to the TestData file, let's say Belgium. We now have 6 distinct countries. And now running the exact same code produces the following: enter image description here

It fails at the line

X = np.array(ct.fit_transform(X))

As you can see, X is not changed and there is no encoding done. I have tested this multiple times. So it seems like OneHotEncoder can only handle up to 5 different category values. Is there a parameter that I can change or another method I can do to encode categorical variables with more than 5 values?

PS - I know to remove the dummy variable after the encoding ;)

I am running Python 3.7.7

Thanks!


Solution

  • I think the issue is with the “sparse_threshold” parameter in ColumnTransformer. Try setting it to 0 so all output numpy arrays are dense. The density of your output is falling below 0.3 (the default value) which prompts it to try to switch to sparse arrays but it still contains the string column Continent and sparse arrays can’t contain strings.