Search code examples
textmachine-learningscikit-learntext-classification

For text classification with scikit-learn, do I have to use both, Countvectorizer and TFIDF?


Looking through scikit-learn documentation code, it suggests to implement the Countvectorizer first and then on top TFIDF. Can I use only TFIDF? http://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html

enter image description here

If I only use TFIDF and I give my preprocessed texts as input it won't take the data type (I tried as a list and a np array). Can someone help?


Solution

    1. In the example they show, they use on top of CountVectorizer a TfidfTransformer. Using directly TfidfVectorizer produces the same result. Thus, it is up to you to chose which weighting scheme you want.
    2. I don't understand really well your question. Scikit vectorizers can have different types of input, ranging from list/arrays of strings to file descriptor and others. To construct the ngrams, it uses the argument tokenizer= and preprocessor=. What is your issue here ?