python machine-learning classification sentiment-analysis text-classification

Classification: Tweet Sentiment Analysis - Order of steps

I am currently working on a tweet sentiment analysis and have a few questions regarding the right order of the steps. Please assume that the data was already preprocessed and prepared accordingly. So this is how I would proceed:

use train_test_split (80:20 ratio) to withhold a test data set.
vectorize x_train since the tweets are not numerical.

In the next steps, I would like to identify the best classifier. Please assume those were already imported. So I would go on by:

hyperparameterization (grid-search) including a cross-validation approach. In this step, I would like to identify the best parameters of each classifier. For KNN the code is as follows:

model = KNeighborsClassifier()
n_neighbors = range(1, 10, 2)
weights = ['uniform', 'distance']
metric = ['euclidean', 'manhattan', 'minkowski']

# define grid search
grid = dict(n_neighbors=n_neighbors, weights=weights ,metric=metric)
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
grid_search = GridSearchCV(estimator=model, param_grid=grid, n_jobs=-1, cv=cv, scoring='accuracy',error_score=0)
grid_result = grid_search.fit(train_tf, y_train)

# summarize results
print("Best: %f using %s" % (grid_result.best_score_, grid_result.best_params_))
means = grid_result.cv_results_['mean_test_score']
stds = grid_result.cv_results_['std_test_score']
params = grid_result.cv_results_['params']
for mean, stdev, param in zip(means, stds, params):
    print("%f (%f) with: %r" % (mean, stdev, param))

compare the accuracy (depending on the best hyperparameters) of the classifiers
choose the best classifier
take the withheld test data set (from train_test_split()) and use the best classifier on the test data

Is this the right approach or would you recommend changing something (e. g. doing the cross-validation alone and not within the hyperparametrization)? Does it make sense to test the test data as the final step or should I do it earlier to assess the accuracy for an unknown data set?

Solution

There are lots of ways to do this and people have strong opinions about it and I'm not always convinced they fully understand what they advocate.

TL;DR: Your methodology looks great and you're asking sensible questions.

Having said that, here are some things to consider:

Why are you doing train-test split validation?
Why are you doing hyperparameter tuning?
Why are you doing cross-validation?

Yes, each of these techniques are good at doing something specific; but that doesn't necessarily mean they should all be part of the same pipeline.

First off, let's answer these questions:

Train-Test Split is useful for testing your classifier's inference abilities. In other words, we want to know how well a classifier performs in general (not on the data we used for training). The test portion allows us to evaluate our classifier without using our training portion.
Hyperparameter-Tuning is useful for evaluating the effect of hyperparameters on the performance of a classifier. For it to be meaningful, we must compare two (or more) models (using different hyperparameters) but trained preferably using the same training portion (to eliminate selection bias). What do we do once we know the best performing hyperparameters? Will this set of hyperparameters always perform optimally? No. You will see that, due to the stochastic nature of classification, one hyperparameter set may work best in experiment A then another set of hyperparameters may work best on experiment B. Rather, hyperparameter tuning is good for generalizing about which hyperparameters to use when building a classifier.
Cross-validation is used to smooth out some of the stochastic randomness associated with building classifiers. So, a machine learning pipeline may produce a classifier that is 94% accurate using 1 test-fold and 83% accuracy using another test-fold. What does it mean? It might mean that 1 fold contains samples that are easy. Or it might mean that the classifier, for whatever reason, is actually better. You don't know because it's a black box.

Practically, how is this helpful?

I see little value in using test-train split and cross-validation. I use cross-validation and report accuracy as an average over the n-folds. It is already testing my classifier's performance. I don't see why dividing your training data further to do another round of train-test validation is going to help. Use the average. Having said that, I use the best performing model of the n-fold models created during cross-validation as my final model. As I said, it's black-box, so we can't know which model is best but, all else being equal, you may as well use the best performing one. It might actually be better.

Hyperparameter-tuning is useful but it can take forever to do extensive tuning. I suggest adding hyperparameter tuning to your pipeline but only test 2 sets of hyperparameters. So, keep all your hyperparameters constant except 1. e.g. Batch size = {64, 128}. Run that, and you'll be able to say with confidence, "Oh, that made a big difference: 64 works better than 128!" or "Well, that was a waste of time. It didn't make much difference either way." If the difference is small, ignore that hyperparameter and try another pair. This way, you'll slowly tack towards optimal without all the wasted time.

In practice, I'd say leave the extensive hyperparameter-tuning to academics and take a more pragmatic approach.

But yeah, you're methodology looks good as it is. I think you thinking about what you're doing and that already puts you a step ahead of the pack.