Search code examples
pythonscikit-learntf-idftfidfvectorizergridsearchcv

sklearn pipeline: running TfidfVectorizer on full training set before applying TimeSeriesSplit inside GridSearchCV?


I'm sure this is possible but I haven't been able to figure it out. Give a training dataset using TimeSeriesSplit with a num_split=5, the splits look like this:

[0] : [1]
[0 1] : [2]
[0 1 2] : [3]
[0 1 2 3] : [4]
[0 1 2 3 4] : [5]

Problem is for the first couple passes, the TfidfVectorizer is working with a nominal amount of vocab/features, and I would like to run that on the entire training set before splitting so that the feature size stays the same for all splits.

Barring that however, does anybody know of a way to, while using TimeSeriesSplit, only pass the last two splits in the series? So instead of all 5 splits, GridSearchCV just uses these two:

[0 1 2 3] : [4]
[0 1 2 3 4] : [5]

This would allow a much better vectorization fit even though it won't be identical between passes-- at least it has a larger portion to work with before validation.

Thanks.

EDIT:

The pipeline I'm using is essentially TfidfVectorizer, and then on to a classifier. But doing some inspection on the data and features it looks like the data set is being split up before being fed to the TfidVectorizer(). Here's the broad strokes:

tscv = TimeSeriesSplit(n_splits=5)
pipe = Pipeline([('tfidf', TfidfVectorizer(), 'rfc', RandomForestClassifier()])
grid = GridSearchCV(pipe, params, cv=tscv, scoring='roc_auc')

Solution

  • This seems to do what I want. Didn't realize you could essentially just pass cv an iterable. All you have to do is create a time series split, or whatever splits you want, and pass an iterable containing the indices. So if you have a 10 item dataset, and you just want the last two time series splits of a num_split=4, you would just pass this to cv:

    cv = [([0, 1, 2, 3, 4, 5], [6, 7]),
          ([0, 1, 2, 3, 4, 5, 6, 7], [8, 9]))
    

    In this way you can pass whatever tuple of iterables you want.