I have a dataset with medical text data and I apply tf-idf vectorizer on them and calculate tf idf score for the words just like this:
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer as tf
vect = tf(min_df=60,stop_words='english')
dtm = vect.fit_transform(df)
l=vect.get_feature_names()
x=pd.DataFrame(dtm.toarray(), columns=vect.get_feature_names())
So basically my question is following-while I'm applying TfidfVectorizer it splits the text in distinct words for example: "pain", "headache", "nausea" and so on. How can I get the words combination in the output of TfidfVectorizer for example: "severe pain", "cluster headache", "nausea vomiting". Thanks
Use ngram_range parameter:
vect = tf(min_df=60, stop_words='english', ngram_range=(1,2))
or (depending on your goals):
vect = tf(min_df=60, stop_words='english', ngram_range=(2,2))