How to find ngram frequency of a column in a pandas dataframe?

Below is the input pandas dataframe I have.



I want to find the frequency of unigrams & bigrams.

How to do this using nltk or scikit learn?

I wrote the below code which takes a string as input. How to extend it to series/dataframe?

from nltk.collocations import *
desc='john is a guy person you him guy person you him'
tokens = nltk.word_tokenize(desc)
bigram_measures = nltk.collocations.BigramAssocMeasures()
finder = BigramCollocationFinder.from_words(tokens)


  • If your data is like

    import pandas as pd
    df = pd.DataFrame([
        'must watch. Good acting',
        'average movie. Bad acting',
        'good movie. Good acting',
        'pathetic. Avoid',
        'avoid'], columns=['description'])

    You could use the CountVectorizer of the package sklearn:

    from sklearn.feature_extraction.text import CountVectorizer
    word_vectorizer = CountVectorizer(ngram_range=(1,2), analyzer='word')
    sparse_matrix = word_vectorizer.fit_transform(df['description'])
    frequencies = sum(sparse_matrix).toarray()[0]
    pd.DataFrame(frequencies, index=word_vectorizer.get_feature_names(), columns=['frequency'])

    Which gives you :

    good            3
    pathetic        1
    average movie   1
    movie bad       2
    watch           1
    good movie      1
    watch good      3
    good acting     2
    must            1
    movie good      2
    pathetic avoid  1
    bad acting      1
    average         1
    must watch      1
    acting          1
    bad             1
    movie           1
    avoid           1


    fit will just "train" your vectorizer : it will split the words of your corpus and create a vocabulary with it. Then transform can take a new document and create vector of frequency based on the vectorizer vocabulary.

    Here your training set is your output set, so you can do both at the same time (fit_transform). Because you have 5 documents, it will create 5 vectors as a matrix. You want a global vector, so you have to make a sum.

    EDIT 2

    For big dataframes, you can speed up the frequencies computation by using:

    frequencies = sum(sparse_matrix).data


    frequencies = sparse_matrix.sum(axis=0).T