Search code examples
pythonpandasnltkcountvectorizer

Bi-grams by date


I have the following dataset:

       Date                  D
_       
0   01/18/2020  shares recipes ... - news updates · breaking news emails · lives to remem...
1   01/18/2020  both sides of the pineapple slices with olive oil. ... some of my other support go-to's i...
2   01/18/2020  honey, tea tree oil ...learn more from webmd about honey ...
3   01/18/2020  years of downtown arts | times leaderas the local community dealt with concerns, pet...
4   01/18/2020  brooklyn, ny | opentableblood orange, arugula, hazelnuts, on toast. charcuterie. $16.00. smoked ...
5   01/19/2020  santa maria di leuca - we the italiansthe sounds of the taranta, the smell of tomatoes, olive oil...
6   01/19/2020  abuse in amish communities : nprit's been a minute with sam sanders · code switch · throughline ...
7   01/19/2020  fast, healthy recipe ideas – cbs new ...toss the pork cubes with chili powder, oregano, cumin, c...
9   01/19/2020  100; 51-100 | csnyi have used oregano oil, coconut oil, famciclovir, an..

I am interested in a dataframe which shows bi-grams' frequencies by Date. Currently I am doing as follows:

import nltk
from nltk import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

stop_words = stopwords.words('english')

word_vectorizer = CountVectorizer(ngram_range=(2,2), analyzer='word', stop_words=stop_words)
sparse_matrix = word_vectorizer.fit_transform(df['D'])
frequencies = sum(sparse_matrix).toarray()[0] 
pd.DataFrame(frequencies, index=word_vectorizer.get_feature_names(), columns=['Frequency']).sort_values(by='Frequency',ascending=False)

But it does not show the bi-grams by Date, only their frequency. I would like expect something like this (expected output):

Date           Bi-gram      Frequency
01/18/2020     bi-gram_1     43
               bi-gram_2     12
               ...
01/19/2020     bi-gram_5     42
               bi-gram_6     23

and so on. bi-grams_1, bi-grams_2, ... are just used as example.

Any advice on how I can get such a dataframe?


Solution

    1. The way I went about this problem was to reorganize your original dataframe so that the overarching key was the date and within each date was a list sentences:

       new_df = {}
       for index, row in df.iterrows():
       if row[0] not in new_df.keys():
           new_df[row[0]] = []
      
       new_df[row[0]].append(row[1])
      

    row[0] is the date and row[1] is the data

    The output will look something like this:

        {'1/18/20': ['shares recipes news updates breaking news google', 'shares 
        recipes news updates breaking news seo'], '1/19/20': ['shares recipes news 
        updates breaking news emails', 'shares recipes news updates breaking news 
        web']}
    
    1. Now you can iterate over each date and get the frequency of each bigram within that date. All the data is stored in a similar dataframe as you had above and appended to a list. At the end, the list will contain n dataframe where n is the number of dates in your dataset:

       word_vectorizer = CountVectorizer(ngram_range=(2,2), analyzer='word', 
       stop_words=stop_words)
       frames = []
       for date,values in new_df.items():
           sparse_matrix = word_vectorizer.fit_transform(values)
           frequencies = sum(sparse_matrix).toarray()[0] 
           results = pd.DataFrame(frequencies, index=word_vectorizer.get_feature_names(), columns=['Frequency']).sort_values(by='Frequency',ascending=False)
      
           frames.append(results)
      

    OR if you want the date to show up on all rows of the dataframe you can modify step 2 to be:

    word_vectorizer = CountVectorizer(ngram_range=(2,2), analyzer='word', 
    stop_words=stop_words)
    frames = []
    for date,values in new_df.items():
        sparse_matrix = word_vectorizer.fit_transform(values)
        frequencies = sum(sparse_matrix).toarray()[0] 
        results = pd.DataFrame(frequencies, index=word_vectorizer.get_feature_names(), columns=['Frequency']).sort_values(by='Frequency',ascending=False)
        results["Date"] = [date for i in range(len(results))]
    
        frames.append(results)
    
    1. Finally, you can concatenate the dataframes together:

       pd.concat(frames, keys=[k for k in new_df.keys()])
      

    ***Some improvements you could make would be to find away to re-index the dataframe within pandas itself instead of making a new dictionary.