Search code examples
pythonmachine-learningtrain-test-splitcatboost

Using Catboost Classifier to convert categorical columns


I'm trying to apply CatBoost to one of my columns for categorical features but get following error:

CatBoostError: Invalid type for cat_feature[non-default value idx=0,feature_idx=2]=68892500.0 : cat_features must be integer or string, real number values and NaN values should be converted to string.

I could use one-hot encoding but many on here say CatBoost seems to better at handling this and less prone to overfitting the model.

My data consists of three columns, 'Country', 'year', 'phone users'. Target is 'Country' and 'year' and 'phone users' are Feature.

Data:

Country   year   phone users
Ireland   1989   978
France    1990   854
Spain     1991   882
Turkey    1992   457
...       ...    ...

My code so far:

X = df.loc[115:305]
y = df.loc[80:, 0]

cat_features = list(range(0, X_pool.shape[1]))
Output: [0, 1, 2]

X_train, X_val, y_train, y_val = train_test_split(X_pool, y_pool, 
test_size=0.2, random_state=0)

cbc = CatBoostClassifier(iterations=5, learning_rate=0.1)

cbc.fit(X_train, y_train, eval_set=(X_val, y_val), 
cat_features=cat_features, verbose=False)

print("Model Evaluation Stage")

Do I need to run LabelEncoder before fitting to catboost model? What am I missing here?


Solution

  • As stated in the error message included in your question all the categorical features need to be of type string. To cast 'phone users' (or any other data frame column) to string you can use df['phone users'] = df['phone users'].astype(str).

    CatBoost will then internally encode each categorical feature using either one-hot encoding or target encoding depending on the number of unique values that it takes. There is no need to encode the categorical features beforehand using the LabelEncoder or the OneHotEncoder, see the CatBoost documentation for more details.