Here is my data [as a pandas df]:
print(X_train[numeric_predictors + categorical_predictors].head()) :
bathrooms bedrooms price building_id \
10 1.5 3.0 3000.0 53a5b119ba8f7b61d4e010512e0dfc85
10000 1.0 2.0 5465.0 c5c8a357cba207596b04d1afd1e4f130
100004 1.0 1.0 2850.0 c3ba40552e2120b0acfc3cb5730bb2aa
100007 1.0 1.0 3275.0 28d9ad350afeaab8027513a3e52ac8d5
100013 1.0 4.0 3350.0 0
99993 1.0 0.0 3350.0 ad67f6181a49bde19218929b401b31b7
99994 1.0 2.0 2200.0 5173052db6efc0caaa4d817112a70f32
manager_id
10 5ba989232d0489da1b5f2c45f6688adc
10000 7533621a882f71e25173b27e3139d83d
100004 d9039c43983f6e564b1482b273bd7b01
100007 1067e078446a7897d2da493d2f741316
100013 98e13ad4b495b9613cef886d79a6291f
...
99993 9fd3af5b2d23951e028059e8940a55d7
99994 d7f57128272bfd82e33a61999b5f4c42
The last two columns are the categorical predictors.
Similarly, printing the pandas series X_train[target]:
10 medium
10000 low
100004 high
100007 low
100013 low
...
99993 low
99994 low
I am trying to use a pipeline template and get an error with hashing vectorizers.
First, here is my dictionary hasher which gives me a MemoryError:
from sklearn.feature_extraction import DictVectorizer
dv = DictVectorizer(sparse=False)
feature_dict = X_train[categorical_predictors].to_dict(orient='records')
dv.fit(feature_dict)
out = pd.DataFrame(
dv.transform(feature_dict),
columns = dv.feature_names_
)
So in the next cell, I use the following code as my feature hashing encoder:
from sklearn.feature_extraction import FeatureHasher
fh = FeatureHasher(n_features=2)
feature_dict = X_train[categorical_predictors].to_dict(orient='records')
fh.fit(feature_dict)
out = pd.DataFrame(fh.transform(feature_dict).toarray())
#print out.head()
The commented out print line gives me a DataFrame with feature rows containing a -1.0, 0.0, or 1.0 float in each of the 2 cells per row.
Here is my vectorizer putting together dictionary & feature hashing:
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.feature_extraction import FeatureHasher, DictVectorizer
class MyVectorizer(BaseEstimator, TransformerMixin):
"""
Vectorize a set of categorical variables
"""
def __init__(self, cols, hashing=None):
"""
args:
cols: a list of column names of the categorical variables
hashing:
If None, then vectorization is a simple one-hot-encoding.
If an integer, then hashing is the number of features in the output.
"""
self.cols = cols
self.hashing = hashing
def fit(self, X, y=None):
data = X[self.cols]
# Choose a vectorizer
if self.hashing is None:
self.myvec = DictVectorizer(sparse=False)
else:
self.myvec = FeatureHasher(n_features = self.hashing)
self.myvec.fit(X[self.cols].to_dict(orient='records'))
return self
def transform(self, X):
# Vectorize Input
if self.hashing is None:
return pd.DataFrame(
self.myvec.transform(X[self.cols].to_dict(orient='records')),
columns = self.myvec.feature_names_
)
else:
return pd.DataFrame(
self.myvec.transform(X[self.cols].to_dict(orient='records')).toarray()
)
I put it all together in my pipeline:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import FeatureUnion
pipeline = Pipeline([
('preprocess', FeatureUnion([
('numeric', Pipeline([
('scale', StandardScaler())
])
),
('categorical', Pipeline([
('vectorize', MyVectorizer(cols=['categorical_predictors'], hashing=None))
])
)
])),
('predict', MultinomialNB(alphas))
])
and alpha parameters:
alphas = {
'predict__alpha': [.01, .1, 1, 2, 10]
}
and use gridsearchCV, when I get an error fitting it at the third line here:
print X_train.head(), train_data[target]
grid_search = GridSearchCV(pipeline, param_grid=alphas,scoring='accuracy')
grid_search.fit(X_train[numeric_predictors + categorical_predictors], X_train[target])
grid_search.best_params_
ValueError: cannot convert string to float: d7f57128272bfd82e33a61999b5f4c42
The error is due to StandardScaler. You are sending all your data into it, which is wrong. In your pipeline, in the FeatureUnion part, you have selected the categorical columns for the MyVectorizer
but did not do any selection for StandardScaler, so all the columns are going into it, which are causing the error. Also, since the internal pipelines only consist of single steps only, there is no need for a pipeline.
As a first step, change the pipeline to:
pipeline = Pipeline([
('preprocess', FeatureUnion([
('scale', StandardScaler()),
('vectorize', MyVectorizer(cols=['categorical_predictors'], hashing=None))
])),
('predict', MultinomialNB())
])
This will still throw the same error, but its looking much less complex now.
Now all we need is something which can select the columns (numerical columns) to be given to the StandardScaler, so that the error is not thrown.
We can do it in many ways, but I am following your coding style and will make a new class MyScaler
, with the changes.
class MyScaler(BaseEstimator, TransformerMixin):
def __init__(self, cols):
self.cols = cols
def fit(self, X, y=None):
self.scaler = StandardScaler()
self.scaler.fit(X[self.cols])
return self
def transform(self, X):
return self.scaler.transform(X[self.cols])
And then change the pipeline to:
numeric_predictors=['bathrooms','bedrooms','price']
categorical_predictors = ['building_id','manager_id']
pipeline = Pipeline([
('preprocess', FeatureUnion([
('scale', MyScaler(cols=numeric_predictors)),
('vectorize', MyVectorizer(cols=['categorical_predictors'], hashing=None))
])),
('predict', MultinomialNB())
])
Still then it throws error, because you have given categorical_predictors as a string to MyVectorizer
, not as a list. Change it to like what I have done in MyScaler
: Change
MyVectorizer(cols=['categorical_predictors'], hashing=None))
to:-
MyVectorizer(cols=categorical_predictors, hashing=None)
Now your code is ready to be execute syntactically. But now you have used MultinomialNB()
as your predictor which requires only positive values in features. But since the StandardScaler scales the data for zero mean, it will transform some values to negative and again your code will fail to work. That thing you need to decide what to do.. Maybe change it to MinMaxScaler.