Search code examples
pythonsklearn-pandas

How to to tie ngram frequency of a column back to the original data frame?


I have a pandas data frame that has account information and a reason for canceling. I have cleaned the data/lemmatized/removed my own stop words to come up with n grams and frequency. How do I add all of the ngrams back to the original data set so the frequencies are with the account level information? Ideally I want to take this and output a file that I can give to the business.

Is there a way I can use the sparse matrix to accomplish this? Not sure if this is possible or even scalable to larger data sets.

Below is a picture of the frequencies I want to attach to the original data frame some how.

frequencies code


Solution

  • I ended up figuring out how to do this:

    After creating the sparse matrix and fitting it to a data frame I was able to merge the data with the original data frame by using the indexes as the joining column. Below is a sample from my code:

    tf_vect_final = CountVectorizer(max_df=0.90,min_df=5,stop_words=stop, 
                                      ngram_range=(5,5),analyzer='word')
    
    tf_vect_final.fit(dfn['Not Written Comments_clean_stop'].tolist())
    
    print("There are {} grams found".format(len(tf_vect_final.get_feature_names())))
    
    tff = tf_vect_final.transform(dfn['Not Written Comments_clean_stop'].tolist())
    
    tff = pd.DataFrame(tff.toarray(),columns=tf_vect_final.get_feature_names())
    
    
    dfn.index.names=['PK']
    tff.index.names=['PK']
    dfn = dfn.reset_index()
    tff = tff.reset_index()
    dfn_final = dfn.merge(tff, on= 'PK')