Search code examples
pythonpandasnumpyscikit-learn

Cannot cast array data from dtype('float64') to dtype('int32') according to 'safe'


I've got a dataset with 6 columns 'Weight'(float), 'Gender'(0 or 1 (int)), 'Height'(float), 'Metabolism'(0,1,2,3 (int)), 'Psychology'(0,1,2,3,4,5,6 (int)) and the column we have to predict is 'Age'(int). I have to do it with sklearn's VotingClassifier. I've split the data this way after I applied one-hot-encoding.

X_train, X_test, y_train, y_test = train_test_split(X_hot, y, test_size=0.25, random_state=1)

I use these 4 algorithms for the classifier.

gbm = GradientBoostingRegressor(loss='huber',n_estimators=5000,max_features="sqrt",subsample=0.9)
gbm.fit(X = X_train,y = np.log1p(y_train))

ada = AdaBoostClassifier(n_estimators=2000)
ada.fit(X = X_train,y = y_train)

log_reg = LogisticRegression()
log_reg.fit(X_train, y_train)

and knn as well. Now, this part works perfectly

from sklearn.ensemble import VotingClassifier
estimators=[('knn', knn_best), ('ada', ada), ('log_reg', log_reg), ('gbm', gbm)]
new_ensemble = VotingClassifier(estimators, voting='hard')
new_ensemble.fit(X_train, y_train)

and this part below is where it shows the error

y_pred = new_ensemble.predict(X_test)

I tried converting everything to float from X_train, X_test, y_train, y_test but it didn't change anything. I changed everything to int but the same error happens as well. Why does that line show the error? I'm really confused.

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-37-86a04c2ceff1> in <module>
----> 1 y_pred = new_ensemble.predict(X_test)

~\AppData\Roaming\Python\Python37\site-packages\sklearn\ensemble\voting_classifier.py in predict(self, X)
    237                 lambda x: np.argmax(
    238                     np.bincount(x, weights=self._weights_not_none)),
--> 239                 axis=1, arr=predictions)
    240 
    241         maj = self.le_.inverse_transform(maj)

~\Anaconda3\lib\site-packages\numpy\lib\shape_base.py in apply_along_axis(func1d, axis, arr, *args, **kwargs)
    378     except StopIteration:
    379         raise ValueError('Cannot apply_along_axis when any iteration dimensions are 0')
--> 380     res = asanyarray(func1d(inarr_view[ind0], *args, **kwargs))
    381 
    382     # build a buffer for storing evaluations of func1d.

~\AppData\Roaming\Python\Python37\site-packages\sklearn\ensemble\voting_classifier.py in <lambda>(x)
    236             maj = np.apply_along_axis(
    237                 lambda x: np.argmax(
--> 238                     np.bincount(x, weights=self._weights_not_none)),
    239                 axis=1, arr=predictions)
    240 

TypeError: Cannot cast array data from dtype('float64') to dtype('int32') according to the rule 'safe'

Solution

  • Try to use parameter voting='soft' for VotingClassifier. I think with voting='hard' it expects integer labels from all models, but gets some float values from regressors. With soft it takes models results as probabilities, and probabilities are float numbers, of course.