I have dataframe that has text columns
and multilabel values
RepID, RepText, Code 1 This is a test. thanks for purchasing... Fruit, Meat 2 Purchased Milk, and Bananas, I also p... Dairy, Fruit, Others
Here is my code
######## df has 1000 records
multilabel_binarizer = MultiLabelBinarizer()
multilabel_binarizer.fit(df['Code'])
y = multilabel_binarizer.transform(df['Code'])
X = df[df.columns.difference(["Code"])]
######## df split into X (RepID, RepText)
######## and y (Code)
xtrain, xval, ytrain, yval = train_test_split(X, y, test_size=0.2, random_state=9)
##### xtrain.shape = (800,3)
##### xval.shape = (200,3)
##### ytrain.shape = (800,1725)
##### yval.shape = (200,1725)
tfidf_vectorizer = TfidfVectorizer(max_df=0.8, max_features=10000)
xtrain_tfidf = tfidf_vectorizer.fit_transform(xtrain)
xval_tfidf = tfidf_vectorizer.transform(xval)
##### But after the code above
##### xtrain_tfidf.shape = (3,3)
##### xval_tfidf.shape = (3,3)
##### ytrain.shape = (800,1725)
##### yval.shape = (200,1725)
##### when means when I do the next line
xval_tfidf.shape
#mdl = LinearRegression()
mdl = LogisticRegression()
#mdl = SVC(gamma='auto', probability=True)
clf = OneVsRestClassifier(mdl)
clf.fit(xtrain_tfidf, ytrain)
I get this error
ValueError: Found input variables with inconsistent numbers of samples: [3, 799]
Why? why am I getting only 3 records instead of 800 after TfidfVectorizer lines?
When I tried to view what is in xtrain_tfidf, I got this
xtrain_tfidf
Out[56]:
<3x3 sparse matrix of type '<class 'numpy.float64'>'
with 3 stored elements in Compressed Sparse Row format>
I found the reason
I forgot to choose only text column in splitting records
xtrain, xval, ytrain, yval = train_test_split(X["RepText"], y, test_size=0.2, random_state=9)