I'm sure this is possible but I haven't been able to figure it out. Give a training dataset using TimeSeriesSplit
with a num_split=5
, the splits look like this:
[0] : [1]
[0 1] : [2]
[0 1 2] : [3]
[0 1 2 3] : [4]
[0 1 2 3 4] : [5]
Problem is for the first couple passes, the TfidfVectorizer
is working with a nominal amount of vocab/features, and I would like to run that on the entire training set before splitting so that the feature size stays the same for all splits.
Barring that however, does anybody know of a way to, while using TimeSeriesSplit
, only pass the last two splits in the series? So instead of all 5 splits, GridSearchCV
just uses these two:
[0 1 2 3] : [4]
[0 1 2 3 4] : [5]
This would allow a much better vectorization fit even though it won't be identical between passes-- at least it has a larger portion to work with before validation.
Thanks.
EDIT:
The pipeline I'm using is essentially TfidfVectorizer, and then on to a classifier. But doing some inspection on the data and features it looks like the data set is being split up before being fed to the TfidVectorizer(). Here's the broad strokes:
tscv = TimeSeriesSplit(n_splits=5)
pipe = Pipeline([('tfidf', TfidfVectorizer(), 'rfc', RandomForestClassifier()])
grid = GridSearchCV(pipe, params, cv=tscv, scoring='roc_auc')
This seems to do what I want. Didn't realize you could essentially just pass cv an iterable. All you have to do is create a time series split, or whatever splits you want, and pass an iterable containing the indices. So if you have a 10 item dataset, and you just want the last two time series splits of a num_split=4
, you would just pass this to cv:
cv = [([0, 1, 2, 3, 4, 5], [6, 7]),
([0, 1, 2, 3, 4, 5, 6, 7], [8, 9]))
In this way you can pass whatever tuple of iterables you want.