Search code examples
pythonscikit-learnnlptf-idfcountvectorizer

Vectorizer the combination of words in Python


I have a dataset with medical text data and I apply tf-idf vectorizer on them and calculate tf idf score for the words just like this:

import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer as tf

vect = tf(min_df=60,stop_words='english')

dtm = vect.fit_transform(df) 
l=vect.get_feature_names() 

x=pd.DataFrame(dtm.toarray(), columns=vect.get_feature_names())

So basically my question is following-while I'm applying TfidfVectorizer it splits the text in distinct words for example: "pain", "headache", "nausea" and so on. How can I get the words combination in the output of TfidfVectorizer for example: "severe pain", "cluster headache", "nausea vomiting". Thanks


Solution

  • Use ngram_range parameter:

    vect = tf(min_df=60, stop_words='english', ngram_range=(1,2))
    

    or (depending on your goals):

    vect = tf(min_df=60, stop_words='english', ngram_range=(2,2))