Search code examples
pythonnlpcatboost

CatBoostClassifier for multiple parameters


I have the following text data for classifier

  1. He is an American basketball player
  2. He played in football in UK.

I want to predict 2 values in my data: country, sport. Example: 1) USA | basketball; 2) UK | football

Currently I'm using CatBoostClassifier() to predict a single value (e.g. country):

vectorizer = CountVectorizer(ngram_range=[1, 2])
x = vectorizer.fit_transform(df['words']).toarray()
y = df['country'].astype(int)
grid = GridSearchCV(CatBoostClassifier(n_estimators=200, silent=False), cv=3,
                param_grid={'learning_rate': [0.03], 'max_depth': [3]})
grid.fit(x, y)
model = grid.best_estimator_

Can I use the classifier to predict 2 or more values and get combined model?


Solution

  • You can use the sklearn.multioutput module which also supports the CatBoostClassifier. All the classifiers provided by this module take a base estimator for single output and extend them to multioutput estimators. You can e.g. use the MultiOutputClassifier this way:

    from catboost import CatBoostClassifier
    from sklearn.multioutput import MultiOutputClassifier
    
    clf = MultiOutputClassifier(CatBoostClassifier(n_estimators=200, silent=False))
    

    Since this is a scikit-learn estimator you can also use it in a grid search as before like this:

    grid = GridSearchCV(clf, param_grid={'estimator__learning_rate': [0.03], 'estimator__max_depth': [3]}, cv=3)
    grid.fit(x, y)
    

    The labels you use to train the model should be in this format:

    import numpy as np
    
    y = np.asarray([['USA', 'basketball'], ['UK', 'football']])
    

    No changes to your features x needed.