I'm having an issue in apply K-fold cross-validation with Tfidf. it gives me this error
ValueError: setting an array element with a sequence.
I have seen other questions who had the same problem but they were using train_test_split() It's a little different with K-fold
for train_fold, valid_fold in kf.split(reviews_p1):
vec = TfidfVectorizer(ngram_range=(1,1))
reviews_p1 = vec.fit_transform(reviews_p1)
train_x = [reviews_p1[i] for i in train_fold] # Extract train data with train indices
train_y = [labels_p1[i] for i in train_fold] # Extract train data with train indices
valid_x = [reviews_p1[i] for i in valid_fold] # Extract valid data with cv indices
valid_y = [labels_p1[i] for i in valid_fold] # Extract valid data with cv indices
svc = LinearSVC()
model = svc.fit(X = train_x, y = train_y) # We fit the model with the fold train data
y_pred = model.predict(valid_x)
Actually, I found where's the problem but I can't find a way to fix it, basically, when we extract train data with cv/train indices we get a list of sparse matrices
[<1x21185 sparse matrix of type '<class 'numpy.float64'>'
with 54 stored elements in Compressed Sparse Row format>,
<1x21185 sparse matrix of type '<class 'numpy.float64'>'
with 47 stored elements in Compressed Sparse Row format>,
<1x21185 sparse matrix of type '<class 'numpy.float64'>'
with 18 stored elements in Compressed Sparse Row format>, ....]
I tried to apply Tfidf on the data after splitting, but it didn't work as the number of features wasn't the same.
So is there any way to split the data for K-fold without creating a list of a sparse matrix?
In an answer to a similar problem Do I use the same Tfidf vocabulary in k-fold cross_validation they suggest
for train_index, test_index in kf.split(data_x, data_y):
x_train, x_test = data_x[train_index], data_x[test_index]
y_train, y_test = data_y[train_index], data_y[test_index]
tfidf = TfidfVectorizer()
x_train = tfidf.fit_transform(x_train)
x_test = tfidf.transform(x_test)
clf = SVC()
clf.fit(x_train, y_train)
y_pred = clf.predict(x_test)
score = accuracy_score(y_test, y_pred)
print(score)