Search code examples
machine-learningnlptext-classificationcountvectorizer

How to use bigrams + trigrams + word-marks vocabulary in countVectorizer?


I'm using text classification with naive Bayes and countVectorizer to classify dialects. I read a research paper that the author has used a combination of :

bigrams + trigrams + word-marks vocabulary 

He means by word-marks here, the words that are specific to a certain dialect.

How can I tweak those parameters in countVectorizer?

word marks

So those are examples of word marks, but it isn't what I have, because mine are arabic. So I translated them.

word_marks=['love', 'funny', 'happy', 'amazing']

Those are used to classify a text.

Also, in the this post: Understanding the `ngram_range` argument in a CountVectorizer in sklearn

There was this answer :

>>> v = CountVectorizer(ngram_range=(1, 2), vocabulary={"keeps", "keeps the"})
>>> v.fit_transform(["an apple a day keeps the doctor away"]).toarray()
array([[1, 1]])  # unigram and bigram found

I couldn't understand the output, what does [1,1] mean here? and how was he able to use ngram with vocabulary? aren't both of them mutually exclusive?


Solution

  • You want to use the n_gram range argument to use bigrams and trigrams. In your case, it would be CountVectorizer(ngram_range=(1, 3)).

    See the accepted answer to this question for more details.

    Please provide example of "word-marks" for the other part of your question.

    You may have to run CountVectorizer twice - once for n-grams and once for your custom word-mark vocabulary. You can then concatenate the two outputs from the two CountVectorizers to get a single feature set of n-gram counts and custom vocabulary counts. The answer to the above question also explains how to specify a custom vocabulary for this second use of CountVectorizer.

    Here's a SO answer on concatenating arrays