Search code examples
pythonscikit-learndata-sciencesklearn-pandasone-hot-encoding

sklearn OneHotEncoder with ColumnTransformer resulting in sparse Matrix in place of creating dummies


I am trying to convert categorical value to integer using OneHotEncoder and ColumnTransformer. My understanding is it should create dummies for category columns like pd.get_dummies. My file is having ~1500 records and 10 columns.

from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
cat_features=['COMPANY_NAME', 'BRAND_NAME']
enc=OneHotEncoder()

transformer = ColumnTransformer([("enc", 
                                  enc,
                                  cat_features)],
                                  remainder="passthrough")
df_transformed = transformer.fit_transform(df_model)
df_transformed)

The result is:

<1574x37 sparse matrix of type '<class 'numpy.float64'>'
    with 15513 stored elements in Compressed Sparse Row format>

When I try to look at the data after converting it into dataframe using:

enter image description here

What is wrong I am doing. My data looks something like below:

enter image description here


Solution

  • You need to convert it to a dense array before putting it into a data.frame, see help page too:

    pd.DataFrame(df_transformed.toarray())
    

    Or you set the transformer to always return a dense array, see the sparse threshold option

    transformer = ColumnTransformer([("enc", 
                                      enc,
                                      cat_features)],
                                      remainder="passthrough",sparse_threshold=0)