Search code examples
pandasnlpscikit-learnnltktext-mining

How to find ngram frequency of a column in a pandas dataframe?


Below is the input pandas dataframe I have.

enter image description here

I want to find the frequency of unigrams & bigrams. A sample of what I am expecting is shown belowenter image description here

How to do this using nltk or scikit learn?

I wrote the below code which takes a string as input. How to extend it to series/dataframe?

from nltk.collocations import *
desc='john is a guy person you him guy person you him'
tokens = nltk.word_tokenize(desc)
bigram_measures = nltk.collocations.BigramAssocMeasures()
finder = BigramCollocationFinder.from_words(tokens)
finder.ngram_fd.viewitems()

Solution

  • If your data is like

    import pandas as pd
    df = pd.DataFrame([
        'must watch. Good acting',
        'average movie. Bad acting',
        'good movie. Good acting',
        'pathetic. Avoid',
        'avoid'], columns=['description'])
    

    You could use the CountVectorizer of the package sklearn:

    from sklearn.feature_extraction.text import CountVectorizer
    word_vectorizer = CountVectorizer(ngram_range=(1,2), analyzer='word')
    sparse_matrix = word_vectorizer.fit_transform(df['description'])
    frequencies = sum(sparse_matrix).toarray()[0]
    pd.DataFrame(frequencies, index=word_vectorizer.get_feature_names(), columns=['frequency'])
    

    Which gives you :

                    frequency
    good            3
    pathetic        1
    average movie   1
    movie bad       2
    watch           1
    good movie      1
    watch good      3
    good acting     2
    must            1
    movie good      2
    pathetic avoid  1
    bad acting      1
    average         1
    must watch      1
    acting          1
    bad             1
    movie           1
    avoid           1
    

    EDIT

    fit will just "train" your vectorizer : it will split the words of your corpus and create a vocabulary with it. Then transform can take a new document and create vector of frequency based on the vectorizer vocabulary.

    Here your training set is your output set, so you can do both at the same time (fit_transform). Because you have 5 documents, it will create 5 vectors as a matrix. You want a global vector, so you have to make a sum.

    EDIT 2

    For big dataframes, you can speed up the frequencies computation by using:

    frequencies = sum(sparse_matrix).data
    

    or

    frequencies = sparse_matrix.sum(axis=0).T