python scikit-learn pipeline grid-search

dictionary hashing memory error and feature hashing float error

Here is my data [as a pandas df]:

print(X_train[numeric_predictors + categorical_predictors].head()) :

        bathrooms  bedrooms   price                       building_id  \
10            1.5       3.0  3000.0  53a5b119ba8f7b61d4e010512e0dfc85   
10000         1.0       2.0  5465.0  c5c8a357cba207596b04d1afd1e4f130   
100004        1.0       1.0  2850.0  c3ba40552e2120b0acfc3cb5730bb2aa   
100007        1.0       1.0  3275.0  28d9ad350afeaab8027513a3e52ac8d5   
100013        1.0       4.0  3350.0                                 0  

99993         1.0       0.0   3350.0  ad67f6181a49bde19218929b401b31b7   
99994         1.0       2.0   2200.0  5173052db6efc0caaa4d817112a70f32   


                              manager_id  
10      5ba989232d0489da1b5f2c45f6688adc  
10000   7533621a882f71e25173b27e3139d83d  
100004  d9039c43983f6e564b1482b273bd7b01  
100007  1067e078446a7897d2da493d2f741316  
100013  98e13ad4b495b9613cef886d79a6291f  
...
99993   9fd3af5b2d23951e028059e8940a55d7  
99994   d7f57128272bfd82e33a61999b5f4c42

The last two columns are the categorical predictors.

Similarly, printing the pandas series X_train[target]:

10        medium
10000        low
100004      high
100007       low
100013       low
...
99993        low
99994        low

I am trying to use a pipeline template and get an error with hashing vectorizers.

First, here is my dictionary hasher which gives me a MemoryError:

from sklearn.feature_extraction import DictVectorizer

dv = DictVectorizer(sparse=False)
feature_dict = X_train[categorical_predictors].to_dict(orient='records')
dv.fit(feature_dict)
out = pd.DataFrame(
    dv.transform(feature_dict),
    columns = dv.feature_names_
)

So in the next cell, I use the following code as my feature hashing encoder:

from sklearn.feature_extraction import FeatureHasher

fh = FeatureHasher(n_features=2)
feature_dict = X_train[categorical_predictors].to_dict(orient='records')
fh.fit(feature_dict)
out = pd.DataFrame(fh.transform(feature_dict).toarray())
#print out.head()

The commented out print line gives me a DataFrame with feature rows containing a -1.0, 0.0, or 1.0 float in each of the 2 cells per row.

Here is my vectorizer putting together dictionary & feature hashing:

from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.feature_extraction import FeatureHasher, DictVectorizer

class MyVectorizer(BaseEstimator, TransformerMixin):
    """
    Vectorize a set of categorical variables
    """

    def __init__(self, cols, hashing=None):
        """
        args:
            cols: a list of column names of the categorical variables
            hashing: 
                If None, then vectorization is a simple one-hot-encoding.
                If an integer, then hashing is the number of features in the output.
        """
        self.cols = cols
        self.hashing = hashing

    def fit(self, X, y=None):

        data = X[self.cols]

        # Choose a vectorizer
        if self.hashing is None:
            self.myvec = DictVectorizer(sparse=False)
        else:
            self.myvec = FeatureHasher(n_features = self.hashing)

        self.myvec.fit(X[self.cols].to_dict(orient='records'))
        return self

    def transform(self, X):

        # Vectorize Input
        if self.hashing is None:
            return pd.DataFrame(
                self.myvec.transform(X[self.cols].to_dict(orient='records')),
                columns = self.myvec.feature_names_
            )
        else:
            return pd.DataFrame(
                self.myvec.transform(X[self.cols].to_dict(orient='records')).toarray()
            )

I put it all together in my pipeline:

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import FeatureUnion

pipeline = Pipeline([
    ('preprocess', FeatureUnion([
        ('numeric', Pipeline([
            ('scale', StandardScaler())
        ])
        ),
        ('categorical', Pipeline([
            ('vectorize', MyVectorizer(cols=['categorical_predictors'], hashing=None))
        ])
        )
    ])),
    ('predict', MultinomialNB(alphas))
])

and alpha parameters:

alphas = {
    'predict__alpha': [.01, .1, 1, 2, 10]
}

and use gridsearchCV, when I get an error fitting it at the third line here:

print X_train.head(), train_data[target]
grid_search = GridSearchCV(pipeline, param_grid=alphas,scoring='accuracy')
grid_search.fit(X_train[numeric_predictors + categorical_predictors], X_train[target])
grid_search.best_params_

ValueError: cannot convert string to float: d7f57128272bfd82e33a61999b5f4c42

Solution

The error is due to StandardScaler. You are sending all your data into it, which is wrong. In your pipeline, in the FeatureUnion part, you have selected the categorical columns for the MyVectorizer but did not do any selection for StandardScaler, so all the columns are going into it, which are causing the error. Also, since the internal pipelines only consist of single steps only, there is no need for a pipeline.

As a first step, change the pipeline to:

pipeline = Pipeline([
    ('preprocess', FeatureUnion([
        ('scale', StandardScaler()),
        ('vectorize', MyVectorizer(cols=['categorical_predictors'], hashing=None))
    ])),
    ('predict', MultinomialNB())
])

This will still throw the same error, but its looking much less complex now.

Now all we need is something which can select the columns (numerical columns) to be given to the StandardScaler, so that the error is not thrown.

We can do it in many ways, but I am following your coding style and will make a new class MyScaler, with the changes.

class MyScaler(BaseEstimator, TransformerMixin):

    def __init__(self, cols):
        self.cols = cols

    def fit(self, X, y=None):

        self.scaler = StandardScaler()
        self.scaler.fit(X[self.cols])
        return self

    def transform(self, X):
        return self.scaler.transform(X[self.cols])

And then change the pipeline to:

numeric_predictors=['bathrooms','bedrooms','price']
categorical_predictors = ['building_id','manager_id']

pipeline = Pipeline([
    ('preprocess', FeatureUnion([
        ('scale', MyScaler(cols=numeric_predictors)),
        ('vectorize', MyVectorizer(cols=['categorical_predictors'], hashing=None))
    ])),
    ('predict', MultinomialNB())
])

Still then it throws error, because you have given categorical_predictors as a string to MyVectorizer, not as a list. Change it to like what I have done in MyScaler: Change

MyVectorizer(cols=['categorical_predictors'], hashing=None))

to:-

MyVectorizer(cols=categorical_predictors, hashing=None)

Now your code is ready to be execute syntactically. But now you have used MultinomialNB() as your predictor which requires only positive values in features. But since the StandardScaler scales the data for zero mean, it will transform some values to negative and again your code will fail to work. That thing you need to decide what to do.. Maybe change it to MinMaxScaler.