Search code examples
pandasmachine-learningscikit-learnnltknaivebayes

Should I keep the proportion of categories when executing an stratification?


I have 30,000 phrases categorized by sentiment.

I'm gonna use Naive Bayes.

Here's the proportion (sentiment -> number of phrases).

anger           98
boredom        157
empty          659
enthusiasm     522
fun           1088
happiness     2986
hate          1187
love          2068
neutral       6340
relief        1021
sadness       4828
surprise      1613
worry         7433

So, I have to split my dataset into train/test to execute my model, etc, right?

Should I keep the proportion of the categories when executing the stratification?

I mean, if I pick 30% for the test sample, should I keep 30% of each sentiment instead of 30% of the whole dataset?

I guess yes, but I would like to have a more experienced opinion.

And how would you do that? Anyone here know a better way of doing that instead of executing a python loop, testing which sentiment, calculate 30%, put in a dictionary etc?

Is there any Pandas trick to stratify by a category feature, keeping the proportion?


Solution

  • Should I keep the proportion of the categories when executing the stratification?

    You seem a little confused regarding the terminology; the very definition of stratification (or stratified sampling) is exactly to maintain the proportions, otherwise it is simple random sampling.

    if I pick 30% for the test sample, should I keep 30% of each sentiment instead of 30% of the whole dataset?

    They are not contradictory, are they? If you keep 30% of each category, won't you end up with the 30% of your initial set?

    Is there any Pandas trick to stratify by a category feature, keeping the proportion?

    Don't know about pandas, but scikit-learn (which I guess you are going to use next) model_selection.train_test_split includes such a stratify option:

    from sklearn.model_selection import train_test_split
    X_train, X_test, y_train, y_test = train_test_split(X, y,
                                                        stratify=y, 
                                                        test_size=0.3)