I have a pandas data frame that has account information and a reason for canceling. I have cleaned the data/lemmatized/removed my own stop words to come up with n grams and frequency. How do I add all of the ngrams back to the original data set so the frequencies are with the account level information? Ideally I want to take this and output a file that I can give to the business.
Is there a way I can use the sparse matrix to accomplish this? Not sure if this is possible or even scalable to larger data sets.
Below is a picture of the frequencies I want to attach to the original data frame some how.
I ended up figuring out how to do this:
After creating the sparse matrix and fitting it to a data frame I was able to merge the data with the original data frame by using the indexes as the joining column. Below is a sample from my code:
tf_vect_final = CountVectorizer(max_df=0.90,min_df=5,stop_words=stop,
ngram_range=(5,5),analyzer='word')
tf_vect_final.fit(dfn['Not Written Comments_clean_stop'].tolist())
print("There are {} grams found".format(len(tf_vect_final.get_feature_names())))
tff = tf_vect_final.transform(dfn['Not Written Comments_clean_stop'].tolist())
tff = pd.DataFrame(tff.toarray(),columns=tf_vect_final.get_feature_names())
dfn.index.names=['PK']
tff.index.names=['PK']
dfn = dfn.reset_index()
tff = tff.reset_index()
dfn_final = dfn.merge(tff, on= 'PK')