Search code examples
pythondataframemachine-learningtext-classificationcountvectorizer

How to use text classification with dataframe in python


I'm using text classification to classify dialects. However, I noticed that I have to use countVectorizer like so:

from sklearn.feature_extraction.text import CountVectorizer  
vectorizer = CountVectorizer(max_features=200, min_df=2, max_df=0.7, stop_words=stopwords.words('arabic'))  
X = vectorizer.fit_transform(X).toarray()

what happens is that I have make a new text file for every line in my csv file. I have collected 1000 tweets from twitter. and they're labeled. and I have them as csv in one file.

I have 2 questions:

  1. Do I have to do this? separate every line in one text file? or I can use it as a dataframe
  2. Do I have to use countVectorizer in text classification? is there another way?

Solution

    1. No, you dont have to separate every line in a new text file. If you look at the official sklearn document example https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html , you will see how to do it. If you want to follow that example, then you will have to convert your csv column of tweets from dataframe to a list and pass it to the function the same way they did it in the document example.

    2. No, you dont have to use countvectorizer. there are several other ways to do this like Tf-IDF, Word2Vec, bag-of-words, etc. There are several method of converting text to vectors for classification. For your case, I believe TF-IDF or Word2Vec will work fine.