I have read some posts on various CV approaches. But what I don't understand is why shuffling the data in the function leads to significant increase in accuracy and when it is correct to do it.
In my times series dataset of size 921 *10080
where each row is a time series of water temperature of a particular location in an area and the 2 last columns being the labels with 2 groups, ie. high risk(high bacteria level in water) and low risk(low bacteria bacteria in water), accuracy varies very differently based on if I set "shuffle=True"(achieved accuracy of around 75%)
, versus accuracy of 50%
when setting "shuffle=False"
in StratifiedKFold
as shown below:
n_folds = 5
skf = StratifiedKFold(n_splits=n_folds, shuffle=True)
The sklearn documentations states the following:
A note on shuffling
If the data ordering is not arbitrary (e.g. samples with the same class label are contiguous), shuffling it first may be essential to get a meaningful cross- validation result. However, the opposite may be true if the samples are not independently and identically distributed. For example, if samples correspond to news articles, and are ordered by their time of publication, then shuffling the data will likely lead to a model that is overfit and an inflated validation score: it will be tested on samples that are artificially similar (close in time) to training samples.
Some cross validation iterators, such as KFold, have an inbuilt option to shuffle the data indices before splitting them. Note that:
• This consumes less memory than shuffling the data directly.
• By default no shuffling occurs, including for the (stratified) K fold cross- validation performed by specifying cv=some_integer to cross_val_score, grid search, etc. Keep in mind that train_test_split still returns a random split.
• The random_state parameter defaults to None, meaning that the shuffling will be different every time KFold(..., shuffle=True) is iterated. However, GridSearchCV will use the same shuffling for each set of parameters validated by a single call to its fit method.
• To get identical results for each split, set random_state to an integer.
I am not sure if I interpret the documentation correctly- an explanation is much appreciated. Besides, I have a few questions:
1)Why is there such huge improvement in accuracy after shuffling? Am I overfitting? When should I shuffle?
2)Given that all samples are collected from the same area, they are probably not independent. How does this affect shuffling? Is it still valid to shuffle?
3)Does shuffling separate the labels from their corresponding X
data? (Answer update : No. Shuffling does not separate labels from their corresponding X
Your question is quite tricky and probably it is better placed here.
In my times series dataset of size 921 *10080 where each row is a time series of water temperature of a particular location in an area and the last column being the label with 2 groups
Aren't you using using a classification problem with timeseries futures? You are using dependeten variables (timeseries of the water temperature) to predict a label. For me this sounds risky, and I would assume that there is not a good chance to predict the label. Just one scenario to think about:
Location Time1 Time2 Time3 Label
A 3 2 1 1
B 100 99 98 1
C 98 99 100 0
So in this example label 1 is a timeseries which goes down and label 0 is a timeseries that goes up, but I would bet every classifier has a problem to learn it without connecting the trending component of your columns.
To come back to your question, this can help you to understand shuffling: difference between StratifiedKFold and StratifiedShuffleSplit in sklearn