Search code examples
pythonpandasgroup-bypandas-groupbycountvectorizer

How to group-by and get most frequent ngram?


My dataframe looks like this:

ID topics   text
1     1        twitter is my favorite social media
2     1        favorite social media
3     2        rt twitter tomorrow
4     3        rt facebook today
5     3        rt twitter
6     4        vote for the best twitter
7     2        twitter tomorrow
8     4        best twitter

I want to group by topics and use count vectorizer (I really prefer to use countvectorize because it allows to remove stop words in multiple languages and I can set a range of 3, 4 grams)to compute the most frequent bigrams. After I get the most frequent bigram, I want to create a new columns called "biagram" and assign the most frequent bigram per topic to that column.

I want my output to look like this.

ID topics      text                                 biagram
1     1        twitter is my favorite social       favorite social
2     1        favorite social media               favorite  social
3     2        rt twitter tomorrow                 twitter tomorrow
4     2        twitter tomorrow                    twitter tomorrow
5     3        rt twitter                          rt twitter
6     3        rt facebook today           rt twitter 
7     4        vote for the bes twitter               best twitter
8     4        best twitter                        best twitter

Please note that the column 'topics' does NOT need to be in order by topics. I ordered for the sake of visualization when creating this post.

This code will be run on 6M rows of data, so it needs to be fast.

What is the best way to do it using pandas? I apologize if it seems too complicated.


Solution

  • Update

    You can use sklearn:

    trom sklearn.feature_extraction.text import CountVectorizer
    
    vect = CountVectorizer(analyzer='word', ngram_range=(2, 2), stop_words='english')
    data = vect.fit_transform(df['text'])
    bigram = (pd.DataFrame(data=data.toarray(),
                           index=df['topics'],
                           columns=vect.get_feature_names_out())
                .groupby('topics').sum().idxmax(axis=1))
    df['bigram'] = df['topics'].map(bigram)
    print(df)
    
    # Output
       ID  topics                                 text            bigram
    0   1       1  twitter is my favorite social media   favorite social
    1   2       1                favorite social media   favorite social
    2   3       2                  rt twitter tomorrow  twitter tomorrow
    3   4       3                    rt facebook today    facebook today
    4   5       3                           rt twitter    facebook today
    5   6       4            vote for the best twitter      best twitter
    6   7       2                     twitter tomorrow  twitter tomorrow
    7   8       4                         best twitter      best twitter
    

    Update 2

    how about if I want the 3 most frequent ngrams. What can I use instead of idxmax()?

    most_common3 = lambda x: x.sum().nlargest(3).index.to_frame(index=False).squeeze()
    bigram = (pd.DataFrame(data=data.toarray(),
                           index=df['topics'],
                           columns=vect.get_feature_names_out())
                .groupby('topics').apply(most_common3)
                .rename(columns=lambda x: f"bigram{x+1}").reset_index())
    df = df.merge(bigram, on='topics')
    print(df)
    
    # Output
       topics                                 text           bigram1       bigram2           bigram3
    0       1  twitter is my favorite social media   favorite social  social media  twitter favorite
    1       1                favorite social media   favorite social  social media  twitter favorite
    2       2                  rt twitter tomorrow  twitter tomorrow    rt twitter      best twitter
    3       2                     twitter tomorrow  twitter tomorrow    rt twitter      best twitter
    4       3                    rt facebook today    facebook today   rt facebook        rt twitter
    5       3                           rt twitter    facebook today   rt facebook        rt twitter
    6       4            vote for the best twitter      best twitter     vote best    facebook today
    7       4                         best twitter      best twitter     vote best    facebook today
    

    Old answer

    You can use nltk:

    import nltk
    
    to_bigram = lambda x: list(nltk.bigrams(x.split()))
    most_common = (df.set_index('topics')['text'].map(to_bigram)
                     .groupby(level=0).apply(lambda x: x.mode()[0][0]))
    
    df['bigram'] = df['topics'].map(most_common)
    print(df)
    
    # Output
       ID  topics                                 text              bigram
    0   1       1  twitter is my favorite social media  (favorite, social)
    1   2       1                favorite social media  (favorite, social)
    2   3       2                  rt twitter tomorrow       (rt, twitter)
    3   4       3                    rt facebook today      (rt, facebook)
    4   5       3                           rt twitter      (rt, facebook)
    5   6       4            vote for the best twitter     (best, twitter)
    6   7       2                     twitter tomorrow       (rt, twitter)
    7   8       4                         best twitter     (best, twitter)