Search code examples
python-3.xscikit-learnxgboosttext-classification

XGBoost only predcinting single class for the unseen data out of 18 classes for Multiclass Text Classification problem


In the current situation, the XGBoost model that I have trained is only producing a single class for the unseen dataset though the accuracy that I have received on then validation set is around 64% which is not bad for my use case.

In my current use case, I am trying to predict a target class for each text written and these are 18 odd classes and the total data set size is just over a 1000 rows (very small) but my primary question is that why XGB is only producing a single class.

I am using the following code to achieve this:

 #tf-idf verctor representation
 tfidf_vect = TfidfVectorizer(analyzer='word',
                     stop_words=stopwords_custom,
                     max_features=total_features,
                     lowercase=True)
 fitted_vectorizer = tfidf_vect.fit(X_train)
 xtrain_tfidf = fitted_vectorizer.transform(X_train)
 xval_tfidf = fitted_vectorizer.transform(X_val)
 #Running the XGB model
 xgb_params = {"max_depth": (3,5,7),'n_estimators': (50,100,150),
'reg_alpha':[0.1,0.5,1],'reg_lambda':[1,1.5,2],'min_child_weight':[2,4,6]}
 from xgboost import XGBClassifier
 xgb_clf = XGBClassifier()
 grid_xgb = GridSearchCV(estimator=xgb_clf,
                    param_grid=xgb_params,
                    cv=5,
                    n_jobs=-1)
 grid_xgb.fit(xtrain_tfidf,y_train)

 print(grid_xgb.best_params_)
 print(grid_xgb.best_score_)
 #Training the model with best params
 final_xgb = XGBClassifier(max_depth = 5,
                      reg_alpha = 1,
                      reg_lambda = 1,
                      n_estimators = 100,
                      objective='multi:softmax',num_class=18,
                      random_state=42)
 final_xgb.fit(xtrain_tfidf,y_train)
 final_xgb_predict = final_xgb.predict(xval_tfidf)
 xgb_accuracy = metrics.accuracy_score(final_xgb_predict, y_val)
 print ("XGBoost > Accuracy: ", xgb_accuracy)

Where am I going wrong?


Solution

  • Where am I going wrong?

    TLDR: you are training/making predictions using sparse data matrices, but you should be using dense data matrices. Convert your fitted_vectorizer.transform(X) results to dense using <ndarray>.todense() method, and see if the situation improves.

    XGBoost interprets an empty cell as a missing value, rather than a 0 count.

    If you replace XGBClassifier with some Scikit-Learn classifier (eg. GradientBoostingClassifier) then your existing code would work as expected. The reason being that Scikit-Learn interprets empty cells differently, as 0 counts.