Search code examples
machine-learningdata-sciencetf-idftfidfvectorizerk-fold

How to apply Kfold with TfidfVectorizer?


I'm having an issue in apply K-fold cross-validation with Tfidf. it gives me this error

ValueError: setting an array element with a sequence.

I have seen other questions who had the same problem but they were using train_test_split() It's a little different with K-fold

for train_fold, valid_fold in kf.split(reviews_p1):
    vec = TfidfVectorizer(ngram_range=(1,1))
    reviews_p1 = vec.fit_transform(reviews_p1)

    train_x = [reviews_p1[i] for i in train_fold]        # Extract train data with train indices
    train_y = [labels_p1[i] for i in train_fold]        # Extract train data with train indices

    valid_x = [reviews_p1[i] for i in valid_fold]        # Extract valid data with cv indices
    valid_y = [labels_p1[i] for i in valid_fold]        # Extract valid data with cv indices

    svc = LinearSVC()
    model = svc.fit(X = train_x, y = train_y) # We fit the model with the fold train data
    y_pred = model.predict(valid_x)

Actually, I found where's the problem but I can't find a way to fix it, basically, when we extract train data with cv/train indices we get a list of sparse matrices

[<1x21185 sparse matrix of type '<class 'numpy.float64'>'
    with 54 stored elements in Compressed Sparse Row format>,
 <1x21185 sparse matrix of type '<class 'numpy.float64'>'
    with 47 stored elements in Compressed Sparse Row format>,
 <1x21185 sparse matrix of type '<class 'numpy.float64'>'
    with 18 stored elements in Compressed Sparse Row format>, ....]

I tried to apply Tfidf on the data after splitting, but it didn't work as the number of features wasn't the same.

So is there any way to split the data for K-fold without creating a list of a sparse matrix?


Solution

  • In an answer to a similar problem Do I use the same Tfidf vocabulary in k-fold cross_validation they suggest

    for train_index, test_index in kf.split(data_x, data_y):
       x_train, x_test = data_x[train_index], data_x[test_index]
       y_train, y_test = data_y[train_index], data_y[test_index]
    
       tfidf = TfidfVectorizer()
       x_train = tfidf.fit_transform(x_train)
       x_test = tfidf.transform(x_test)
    
       clf = SVC()
       clf.fit(x_train, y_train)
       y_pred = clf.predict(x_test)
       score = accuracy_score(y_test, y_pred)
       print(score)