Search code examples
pythonpandasscikit-learncategorical-dataone-hot-encoding

What is difference between One Hot Encoding and pandas.categorical.code


I am working on some problem and have a doubt as below:

In the data set there is a text column with following unique values:

array(['1 bath', 'na', '1 shared bath', '1.5 baths', '1 private bath',
       '2 baths', '1.5 shared baths', '3 baths', 'Half-bath',
       '2 shared baths', '2.5 baths', '0 shared baths', '0 baths',
       '5 baths', 'Private half-bath', 'Shared half-bath', '4.5 baths',
       '5.5 baths', '2.5 shared baths', '3.5 baths', '15.5 baths',
       '6 baths', '4 baths', '3 shared baths', '4 shared baths',
       '3.5 shared baths', '6 shared baths', '6.5 shared baths',
       '6.5 baths', '4.5 shared baths', '7.5 baths', '5.5 shared baths',
       '7 baths', '8 shared baths', '5 shared baths', '8 baths',
       '10 baths', '7 shared baths'], dtype=object)

If I use Count Vectorize to convert them to one hot encoding,


vectorizer = CountVectorizer()
vectorizer.fit(X_train[colname].values) 

I am getting the below error:


AttributeError: 'float' object has no attribute 'lower'


Please let me know the cause of the error.

Instead of that Can I use :

pd.Categorical(_DF_LISTING_EDA.bathrooms_text).codes

What is the difference between One hot encoding and pd.categorical.code?

Thanks Amit Modi


Solution

  • if you want One hot encoding using pandas you can do :

    pandas.get_dummies(X_train[colname])[0]