python nlp text-classification naivebayes countvectorizer

How to reduce the number of features in text classification?

I'm doing dialect text classification and I'm using countVectorizer with naive bayes. The number of features are too many, I have collected 20k tweets with 4 dialects. every dialect have 5000 tweets. And the total number of features are 43K. I was thinking maybe that's why I could be having overfitting. Because the accuracy has dropped a lot when I tested on new data. So how can I fix the number of features to avoid overfitting the data?

Solution

You can set the parameter max_features to 5000 for instance, It might help with overfitting. You could also tinker with max_df (for instance set it to 0.95)