python nlp vectorization tfidfvectorizer data-preprocessing

ValueError: Index length mismatch: 4064 vs. 1

I am working on a NLP problem https://www.kaggle.com/c/nlp-getting-started. I want to perform vectorization after train_test_split but when I do that, the resulting sparse matrix has size = 1 which cannot be right.

My train_x set size is (4064, 1) and after tfidf.fit_transform I get size = 1. How can that be??! Below is my code:

def clean_text(text):
    tokens = nltk.word_tokenize(text)    #tokenizing the words
    lower = [word.lower() for word in tokens]  #converting words to lowercase
    remove_stopwords = [word for word in lower if word not in set(stopwords.words('english'))]  
    remove_char = [word for word in remove_stopwords if word.isalpha()]
    lemm_text = [ps.stem(word) for word in remove_char]     #lemmatizing the words
    cleaned_data = " ".join([str(word) for word in lemm_text])
    return cleaned_data

x['clean_text']= x["text"].map(clean_text)

x.drop(['text'], axis = 1, inplace = True)

from sklearn.model_selection import train_test_split
train_x, test_x, train_y, test_y = train_test_split(x, y, test_size = 0.2, random_state = 69, 
stratify = y)

from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
tfidf = TfidfVectorizer()
train_x_vect = tfidf.fit_transform(train_x)
test_x1 = tfidf.transform(test_x)

pd.DataFrame.sparse.from_spmatrix(train_x_vect,
                              index=train_x.index,
                              columns=tfidf.get_feature_names())

When I try to convert the sparse matrix (with size = 1) into a dataframe, it gives me error.

The dataframe x has size = 4064 and my sparse matrix has size = 1 which is why it is giving me error. Any help will be aprreciated!

Solution

The reason you are getting the error is because TfidfVectorizer only accepts lists as the input. You can check this from the documentation itself.

Here you are passing a Dataframe as the input. Hence the weird output. First convert your dataframe to lists using:

train_x = train_x['column_name'].to_list()

and then pass it to the vectorizer. Also there are many ways to convert dataframe to list but the output of of all of them might be different formats of list. For example if you try to convert dataframe to list using:

train_x = train_x.values.tolist()

it will convert the dataframe to list but then the format of this list won't work with Tidfvectorizer and will give you the same output as you were getting before in your question. I found the above way of converting to list to work with the vectorizer.

Another thing to keep in mind is that you can only have one column/variable in your list/dataframe. If you have more than one columns in your dataframe and you convert it to list and pass it to the vectorizer, it will throw an error! I don't know why this is but just throwing it out there in case someone faces this problem.