I'm trying to apply CatBoost to one of my columns for categorical features but get following error:
CatBoostError: Invalid type for cat_feature[non-default value idx=0,feature_idx=2]=68892500.0 : cat_features must be integer or string, real number values and NaN values should be converted to string.
I could use one-hot encoding but many on here say CatBoost seems to better at handling this and less prone to overfitting the model.
My data consists of three columns, 'Country', 'year', 'phone users'. Target is 'Country' and 'year' and 'phone users' are Feature.
Data:
Country year phone users
Ireland 1989 978
France 1990 854
Spain 1991 882
Turkey 1992 457
... ... ...
My code so far:
X = df.loc[115:305]
y = df.loc[80:, 0]
cat_features = list(range(0, X_pool.shape[1]))
Output: [0, 1, 2]
X_train, X_val, y_train, y_val = train_test_split(X_pool, y_pool,
test_size=0.2, random_state=0)
cbc = CatBoostClassifier(iterations=5, learning_rate=0.1)
cbc.fit(X_train, y_train, eval_set=(X_val, y_val),
cat_features=cat_features, verbose=False)
print("Model Evaluation Stage")
Do I need to run LabelEncoder before fitting to catboost model? What am I missing here?
As stated in the error message included in your question all the categorical features need to be of type string. To cast 'phone users'
(or any other data frame column) to string you can use df['phone users'] = df['phone users'].astype(str)
.
CatBoost will then internally encode each categorical feature using either one-hot encoding or target encoding depending on the number of unique values that it takes. There is no need to encode the categorical features beforehand using the LabelEncoder
or the OneHotEncoder
, see the CatBoost documentation for more details.