Search code examples
pythondataframenlpapplycounter

Count ngram frequency in documents


I would like to count the number of ngrams in each document of my corpus in order to delete those who are the most frequent across all documents (say, those that appear in more than 10 different documents).

import pandas as pd
import numpy as np

data = {'docid': [1, 2, 3], 'bigrams': [['i_am', 'am_not', 'not_very', 'very_smart'], ['i_am', 'am_learning', 'learning_python'], ['i_have', 'have_blue', 'blue_eyes']]}
dataset = pd.DataFrame(data, columns = ['docid', 'bigrams'])


bigrams_list = []
for bigrams in dataset['bigrams']:
    for bigram in bigrams:
        if bigram not in bigrams_list:
            bigrams_list.append(bigram)

Here, I guess I would iterate over the dataframe rows and for each bigrams_list generate a boolean if the bigram is present in the document (row). But that seems not very efficient, knowing that my corpus has more than 5'000 documents and 400'000 distinct bigrams.

Does anyone knows what would be best for this situation ?


Solution

  • You probably can take a look at scikit-learn's CountVectorizer, it's mostly meant for feature preprocessing in NLP but I'm pretty sure you can use it to do efficiently what you need to do (set ngram_range to the desired value(s), fit the vectorizer, and then combine the results of .get_feature_names() with the produced matrix to associate every n-gram with its count over the entire corpus): https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html