I am using TfidfVectorizer
with following parameters:
smooth_idf=False, sublinear_tf=False, norm=None, analyzer='word', ngram_range=(1,2)
I am vectorizing following text: "red sun, pink candy. Green flower."
Here is output of get_feature_names():
['candy', 'candy green', 'coffee', 'flower', 'green', 'green flower', 'hate', 'icecream', 'like', 'moon', 'pink', 'pink candy', 'red', 'red sun', 'sun', 'sun pink']
Since "candy" and "green" are part of the separate sentences, why is "candy green" n-gram created?
Is there a way to prevent creation of n-grams spawning multiple sentences?
Depends on how you are passing that to TfidfVectorizer
!
If passed as a single document, TfidfVectorizer
will only keep words which contain 2 or more alphanumeric characters. Punctuation is completely ignored and always treated as a token separator. So your sentence becomes:
['red', 'sun', 'pink', 'candy', 'green', 'flower']
Now from these tokens, ngrams are generated.
Since TfidfVectorizer
is a bag-of-words technique, working on words appearing in a document, it does not keep any information about the structure or order of words in a single document.
If you want them to be treated separately, then you should detect the sentences yourself and pass them as different documents.
Or else, pass your own analyzer and ngram generator to the TfidfVectorizer
.
For more information on how TfidfVectorizer
actually works, see my other answer: