Search code examples
pythonscikit-learnnlpdecision-treetext-classification

NLP text classification CountVectorizer Shape Error


I have a text dataset which has one column for reviews and another column for labels. I want to build a decision tree model by using that dataset, I used vectorizer but it gives ValueError: Number of labels=37500 does not match number of samples=1 error. vect.vocabulary_ returns {'review': 0} review is the column name. So I think it does not fit to all data. Here is the code below, any help is appreciated.

from sklearn.model_selection import train_test_split
X_train, X_test,y_train, y_test = train_test_split(data.iloc[:,:-1],data.iloc[:,-1:],
test_size = 0.25, random_state = 42)

from sklearn.feature_extraction.text import CountVectorizer
vect = CountVectorizer()
vect.fit(X_train)
X_train_dtm = vect.transform(X_train)
X_train_dtm = vect.fit_transform(X_train)
X_test_dtm = vect.transform(X_test)

from sklearn.tree import DecisionTreeClassifier 
DTC = DecisionTreeClassifier()
DTC.fit(X_train_dtm, y_train)
y1_pred_class = DTC.predict(X_test_dtm)

Also X_train_dtm.shape is <bound method spmatrix.get_shape of <1x1 sparse matrix of type '<class 'numpy.int64'>' with 1 stored elements in Compressed Sparse Row format>>


Solution

  • It worked when I changed this part:

    X_train, X_test,y_train, y_test = train_test_split(data['text'], data['tag'],test_size = 0.25, random_state = 42)