Search code examples
pythonnlptext-classificationnaivebayescountvectorizer

How to reduce the number of features in text classification?


I'm doing dialect text classification and I'm using countVectorizer with naive bayes. The number of features are too many, I have collected 20k tweets with 4 dialects. every dialect have 5000 tweets. And the total number of features are 43K. I was thinking maybe that's why I could be having overfitting. Because the accuracy has dropped a lot when I tested on new data. So how can I fix the number of features to avoid overfitting the data?


Solution

  • You can set the parameter max_features to 5000 for instance, It might help with overfitting. You could also tinker with max_df (for instance set it to 0.95)