Search code examples

Bi-grams by date

I have the following dataset:

       Date                  D
0   01/18/2020  shares recipes ... - news updates · breaking news emails · lives to remem...
1   01/18/2020  both sides of the pineapple slices with olive oil. ... some of my other support go-to's i...
2   01/18/2020  honey, tea tree oil ...learn more from webmd about honey ...
3   01/18/2020  years of downtown arts | times leaderas the local community dealt with concerns, pet...
4   01/18/2020  brooklyn, ny | opentableblood orange, arugula, hazelnuts, on toast. charcuterie. $16.00. smoked ...
5   01/19/2020  santa maria di leuca - we the italiansthe sounds of the taranta, the smell of tomatoes, olive oil...
6   01/19/2020  abuse in amish communities : nprit's been a minute with sam sanders · code switch · throughline ...
7   01/19/2020  fast, healthy recipe ideas – cbs new ...toss the pork cubes with chili powder, oregano, cumin, c...
9   01/19/2020  100; 51-100 | csnyi have used oregano oil, coconut oil, famciclovir, an..

I am interested in a dataframe which shows bi-grams' frequencies by Date. Currently I am doing as follows:

import nltk
from nltk import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

stop_words = stopwords.words('english')

word_vectorizer = CountVectorizer(ngram_range=(2,2), analyzer='word', stop_words=stop_words)
sparse_matrix = word_vectorizer.fit_transform(df['D'])
frequencies = sum(sparse_matrix).toarray()[0] 
pd.DataFrame(frequencies, index=word_vectorizer.get_feature_names(), columns=['Frequency']).sort_values(by='Frequency',ascending=False)

But it does not show the bi-grams by Date, only their frequency. I would like expect something like this (expected output):

Date           Bi-gram      Frequency
01/18/2020     bi-gram_1     43
               bi-gram_2     12
01/19/2020     bi-gram_5     42
               bi-gram_6     23

and so on. bi-grams_1, bi-grams_2, ... are just used as example.

Any advice on how I can get such a dataframe?


    1. The way I went about this problem was to reorganize your original dataframe so that the overarching key was the date and within each date was a list sentences:

       new_df = {}
       for index, row in df.iterrows():
       if row[0] not in new_df.keys():
           new_df[row[0]] = []

    row[0] is the date and row[1] is the data

    The output will look something like this:

        {'1/18/20': ['shares recipes news updates breaking news google', 'shares 
        recipes news updates breaking news seo'], '1/19/20': ['shares recipes news 
        updates breaking news emails', 'shares recipes news updates breaking news 
    1. Now you can iterate over each date and get the frequency of each bigram within that date. All the data is stored in a similar dataframe as you had above and appended to a list. At the end, the list will contain n dataframe where n is the number of dates in your dataset:

       word_vectorizer = CountVectorizer(ngram_range=(2,2), analyzer='word', 
       frames = []
       for date,values in new_df.items():
           sparse_matrix = word_vectorizer.fit_transform(values)
           frequencies = sum(sparse_matrix).toarray()[0] 
           results = pd.DataFrame(frequencies, index=word_vectorizer.get_feature_names(), columns=['Frequency']).sort_values(by='Frequency',ascending=False)

    OR if you want the date to show up on all rows of the dataframe you can modify step 2 to be:

    word_vectorizer = CountVectorizer(ngram_range=(2,2), analyzer='word', 
    frames = []
    for date,values in new_df.items():
        sparse_matrix = word_vectorizer.fit_transform(values)
        frequencies = sum(sparse_matrix).toarray()[0] 
        results = pd.DataFrame(frequencies, index=word_vectorizer.get_feature_names(), columns=['Frequency']).sort_values(by='Frequency',ascending=False)
        results["Date"] = [date for i in range(len(results))]
    1. Finally, you can concatenate the dataframes together:

       pd.concat(frames, keys=[k for k in new_df.keys()])

    ***Some improvements you could make would be to find away to re-index the dataframe within pandas itself instead of making a new dictionary.