In the current situation, the XGBoost model that I have trained is only producing a single class for the unseen dataset though the accuracy that I have received on then validation set is around 64% which is not bad for my use case.
In my current use case, I am trying to predict a target class for each text written and these are 18 odd classes and the total data set size is just over a 1000 rows (very small) but my primary question is that why XGB is only producing a single class.
I am using the following code to achieve this:
#tf-idf verctor representation
tfidf_vect = TfidfVectorizer(analyzer='word',
stop_words=stopwords_custom,
max_features=total_features,
lowercase=True)
fitted_vectorizer = tfidf_vect.fit(X_train)
xtrain_tfidf = fitted_vectorizer.transform(X_train)
xval_tfidf = fitted_vectorizer.transform(X_val)
#Running the XGB model
xgb_params = {"max_depth": (3,5,7),'n_estimators': (50,100,150),
'reg_alpha':[0.1,0.5,1],'reg_lambda':[1,1.5,2],'min_child_weight':[2,4,6]}
from xgboost import XGBClassifier
xgb_clf = XGBClassifier()
grid_xgb = GridSearchCV(estimator=xgb_clf,
param_grid=xgb_params,
cv=5,
n_jobs=-1)
grid_xgb.fit(xtrain_tfidf,y_train)
print(grid_xgb.best_params_)
print(grid_xgb.best_score_)
#Training the model with best params
final_xgb = XGBClassifier(max_depth = 5,
reg_alpha = 1,
reg_lambda = 1,
n_estimators = 100,
objective='multi:softmax',num_class=18,
random_state=42)
final_xgb.fit(xtrain_tfidf,y_train)
final_xgb_predict = final_xgb.predict(xval_tfidf)
xgb_accuracy = metrics.accuracy_score(final_xgb_predict, y_val)
print ("XGBoost > Accuracy: ", xgb_accuracy)
Where am I going wrong?
Where am I going wrong?
TLDR: you are training/making predictions using sparse data matrices, but you should be using dense data matrices. Convert your fitted_vectorizer.transform(X)
results to dense using <ndarray>.todense()
method, and see if the situation improves.
XGBoost interprets an empty cell as a missing value, rather than a 0 count.
If you replace XGBClassifier
with some Scikit-Learn classifier (eg. GradientBoostingClassifier
) then your existing code would work as expected. The reason being that Scikit-Learn interprets empty cells differently, as 0 counts.