I have the following text data for classifier
I want to predict 2 values in my data: country, sport. Example: 1) USA | basketball; 2) UK | football
Currently I'm using CatBoostClassifier()
to predict a single value (e.g. country):
vectorizer = CountVectorizer(ngram_range=[1, 2])
x = vectorizer.fit_transform(df['words']).toarray()
y = df['country'].astype(int)
grid = GridSearchCV(CatBoostClassifier(n_estimators=200, silent=False), cv=3,
param_grid={'learning_rate': [0.03], 'max_depth': [3]})
grid.fit(x, y)
model = grid.best_estimator_
Can I use the classifier to predict 2 or more values and get combined model?
You can use the sklearn.multioutput
module which also supports the CatBoostClassifier
. All the classifiers provided by this module take a base estimator for single output and extend them to multioutput estimators. You can e.g. use the MultiOutputClassifier
this way:
from catboost import CatBoostClassifier
from sklearn.multioutput import MultiOutputClassifier
clf = MultiOutputClassifier(CatBoostClassifier(n_estimators=200, silent=False))
Since this is a scikit-learn
estimator you can also use it in a grid search as before like this:
grid = GridSearchCV(clf, param_grid={'estimator__learning_rate': [0.03], 'estimator__max_depth': [3]}, cv=3)
grid.fit(x, y)
The labels you use to train the model should be in this format:
import numpy as np
y = np.asarray([['USA', 'basketball'], ['UK', 'football']])
No changes to your features x
needed.