stratify argument in train_test_split vs StratifiedShuffleSplit

What is the difference between using the stratify argument in train_test_split function of sklearn, and the StratifiedShuffleSplit function? Don't they do the same thing?

Solution

These two modules perform different operations.

train_test_split, as its name clearly implies, is used for splitting the data in a single training & single test subset, and the stratify argument permits doing this in a stratified way.

StratifiedShuffleSplit, on the other hand, provides splits for cross-validation; from the docs:

Stratified ShuffleSplit cross-validator

Provides train/test indices to split data in train/test sets.

Notice the plural sets (emphasis mine).

So, StratifiedShuffleSplit is there to be used instead of KFold when we want to ensure the CV splits are stratified, and not to replace train_test_split.

Macro VS Micro VS Weighted VS Samples F1 Score
How to pass only necessary features to pipeline after SelectKBest
How to define the search space for a simple equation optimization
TypeError: Feature names are only supported if all input features have string names, but your input has ['str', 'str_'] as column name types
How to create image of confusion matrix in Python
Pass parameters to custom transformer in sklearn
How does sklearn compute the precision_score metric?
In scikit's precision_recall_curve, why does thresholds have a different dimension from recall and precision?
python: How to get real feature name from feature_importances
LogisticRegression: Unknown label type: 'continuous' using sklearn in python
fit method in sklearn
Why lightgbm .predict function has probabilities not between 0 and 1?
The easiest way for getting feature names after running SelectKBest in Scikit Learn
Which estimators in scikit-learn support `partial_fit` API?
How to retrieve the mapping generated from a category_encoder in python?
How to change max_iter in optimize function used by sklearn gaussian process regression?
Predict training data in sklearn
'super' object has no attribute '__sklearn_tags__'
visualize 10x10 grid of each digit using MNIST samples
How to get coefficients of multinomial logistic regression?
Training difference between LightGBM API and Sklearn API
Why can't I wrap LGBM?
displaying scikit decision tree figure in jupyter notebook
What's the best way to use a sklearn feature selector in a grid search, to evaluate the usefulness of all features?
AdaBoostClassifier: Perfect Metrics with test_size=0.25, but Inconsistent Samples Error for Other Values
Linear Model in Julia
Python - generate array of specific autocorrelation
How to specify the levels to iterate in a grid search with an ensemble classifier?
How can I silence `UndefinedMetricWarning`?
ImportError: cannot import name '_check_weights' from 'sklearn.neighbors._base'