Search code examples
pythonmachine-learningscikit-learnone-hot-encodingcountvectorizer

Working With Column transformation for CountVectorizer and OneHotEncoder in sklearn


i have dummy dataframe, with column text and vehicle , i want to use Countvectorizer for text column and onehotencoding for vehicle column

import pandas as pd 
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import make_column_transformer
from sklearn.feature_extraction.text import CountVectorizer

df  = pd.DataFrame([['how are you','car'],['good mrng have a nice day','bike'],['today is my best working day','cycle'],['hello','bike']], columns = ['text','vehicle']) 

enter image description here

preprocess = make_column_transformer((CountVectorizer(), ['text']),(OneHotEncoder(), ['vehicle']))
preprocess.fit_transform(df)

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-15-d7644861c938> in <module>()
----> 1 preprocess.fit_transform(df)

~\AppData\Roaming\Python\Python36\site-packages\sklearn\compose\_column_transformer.py in 
fit_transform(self, X, y)
469         self._validate_output(Xs)
470 
--> 471         return self._hstack(list(Xs))
472 
473     def transform(self, X):

~\AppData\Roaming\Python\Python36\site-packages\sklearn\compose\_column_transformer.py in 
_hstack(self, Xs)
526         else:
527             Xs = [f.toarray() if sparse.issparse(f) else f for f in Xs]
--> 528             return np.hstack(Xs)
529 
530 

C:\ProgramData\Anaconda3\lib\site-packages\numpy\core\shape_base.py in hstack(tup)
338         return _nx.concatenate(arrs, 0)
339     else:
--> 340         return _nx.concatenate(arrs, 1)
341 
342 

ValueError: all the input array dimensions except for the concatenation axis must match exactly

this error is because of the output of both two transformer is different

vect  = CountVectorizer()
vect.fit_transform(df['text'])
#op
<4x14 sparse matrix of type '<class 'numpy.int64'>'
with 15 stored elements in Compressed Sparse Row format>

encoder = OneHotEncoder(handle_unknown='ignore')
encoder.fit_transform(df['vehicle'].to_numpy().reshape(-1, 1)).toarray()

#op
 array([[0., 1., 0.],
   [1., 0., 0.],
   [0., 0., 1.],
   [1., 0., 0.]])

how to apply .to_numpy().reshape(-1,1), or is there any other way to achieve this ???


Solution

  • import pandas as pd 
    from sklearn.preprocessing import OneHotEncoder
    from sklearn.compose import make_column_transformer
    from sklearn.feature_extraction.text import CountVectorizer
    
    df  = pd.DataFrame([['how are you','car'],['good mrng have a nice day','bike'],['today is my best working day','cycle'],['hello','bike']], columns = ['text','vehicle']) 
    

    The change is here :

    preprocess = make_column_transformer((CountVectorizer(), 'text'),(OneHotEncoder(), ['vehicle']))
    

    Instead of passing in 'text' inside a list, it has to be in string format. I believe this is more like safety mechanism to prevent passing in multiple columns into one CountVectorizer.