Search code examples
pythonpandasscikit-learnone-hot-encoding

Scikit-Learn - one-hot encoding certain columns of a pandas dataframe


I have a dataframe X with integer, float and string columns. I'd like to one-hot encode every column that is of "Object" type, so I'm trying to do this:

encoding_needed = X.select_dtypes(include='object').columns
ohe = preprocessing.OneHotEncoder()
X[encoding_needed] = ohe.fit_transform(X[encoding_needed].astype(str)) #need astype bc I imputed with 0, so some rows have a mix of zeroes and strings.

However, I end up with IndexError: tuple index out of range. I don't quite understand this as per the documentation the encoder expects X: array-like, shape [n_samples, n_features], so I should be OK passing a dataframe. How can I one-hot encode the list of columns specifically marked in encoding_needed?

EDIT:

The data is confidential so I cannot share it and I cannot create a dummy as it has 123 columns as is.

I can provide the following:

X.shape: (40755, 123)
encoding_needed.shape: (81,) and is a subset of columns.

Full stack:

---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
<ipython-input-90-6b3e9fdb6f91> in <module>()
      1 encoding_needed = X.select_dtypes(include='object').columns
      2 ohe = preprocessing.OneHotEncoder()
----> 3 X[encoding_needed] = ohe.fit_transform(X[encoding_needed].astype(str))

~/anaconda3/envs/python3/lib/python3.6/site-packages/pandas/core/frame.py in __setitem__(self, key, value)
   3365             self._setitem_frame(key, value)
   3366         elif isinstance(key, (Series, np.ndarray, list, Index)):
-> 3367             self._setitem_array(key, value)
   3368         else:
   3369             # set column

~/anaconda3/envs/python3/lib/python3.6/site-packages/pandas/core/frame.py in _setitem_array(self, key, value)
   3393                 indexer = self.loc._convert_to_indexer(key, axis=1)
   3394                 self._check_setitem_copy()
-> 3395                 self.loc._setitem_with_indexer((slice(None), indexer), value)
   3396 
   3397     def _setitem_frame(self, key, value):

~/anaconda3/envs/python3/lib/python3.6/site-packages/pandas/core/indexing.py in _setitem_with_indexer(self, indexer, value)
    592                     # GH 7551
    593                     value = np.array(value, dtype=object)
--> 594                     if len(labels) != value.shape[1]:
    595                         raise ValueError('Must have equal len keys and value '
    596                                          'when setting with an ndarray')

IndexError: tuple index out of range

Solution

  • # example data
    X = pd.DataFrame({'int':[0,1,2,3],
                       'float':[4.0, 5.0, 6.0, 7.0],
                       'string1':list('abcd'),
                       'string2':list('efgh')})
    
       int  float string1 string2
    0    0    4.0       a       e
    1    1    5.0       b       f
    2    2    6.0       c       g
    3    3    7.0       d       h
    

    Using pandas

    With pandas.get_dummies, it will automatically select your object columns and drop these columns while appenind the one-hot-encoded columns:

    pd.get_dummies(X)
    
       int  float  string1_a  string1_b  string1_c  string1_d  string2_e  \
    0    0    4.0          1          0          0          0          1   
    1    1    5.0          0          1          0          0          0   
    2    2    6.0          0          0          1          0          0   
    3    3    7.0          0          0          0          1          0   
    
       string2_f  string2_g  string2_h  
    0          0          0          0  
    1          1          0          0  
    2          0          1          0  
    3          0          0          1  
    

    Using sklearn

    Here we have to specify that we only need the object columns:

    from sklearn.preprocessing import OneHotEncoder
    
    ohe = OneHotEncoder()
    
    X_object = X.select_dtypes('object')
    ohe.fit(X_object)
    
    codes = ohe.transform(X_object).toarray()
    feature_names = ohe.get_feature_names(['string1', 'string2'])
    
    X = pd.concat([df.select_dtypes(exclude='object'), 
                   pd.DataFrame(codes,columns=feature_names).astype(int)], axis=1)
    
       int  float  string1_a  string1_b  string1_c  string1_d  string2_e  \
    0    0    4.0          1          0          0          0          1   
    1    1    5.0          0          1          0          0          0   
    2    2    6.0          0          0          1          0          0   
    3    3    7.0          0          0          0          1          0   
    
       string2_f  string2_g  string2_h  
    0          0          0          0  
    1          1          0          0  
    2          0          1          0  
    3          0          0          1