Search code examples
pythontexttext-manipulation

N-gram analysis based on impression in Python


This is how my sample dataset looks like:

enter image description here

My goal is to understand how many impressions are associated with one word, two words, three words, four words, five words, and six words. I used to run the N-gram algorithm, but it only returns count. This is my current n-gram code.

def find_ngrams(text, n):
    word_vectorizer = CountVectorizer(ngram_range=(n,n), analyzer='word')
    sparse_matrix = word_vectorizer.fit_transform(text)
    frequencies = sum(sparse_matrix).toarray()[0]
    ngram = 

pd.DataFrame(frequencies,index=word_vectorizer.get_feature_names(),columns=
['frequency'])
ngram = ngram.sort_values(by=['frequency'], ascending=[False])
return ngram

one = find_ngrams(df['query'],1)
bi = find_ngrams(df['query'],2)
tri = find_ngrams(df['query'],3)
quad = find_ngrams(df['query'],4)
pent = find_ngrams(df['query'],5)
hexx = find_ngrams(df['query'],6)

I figure what I need to do is: 1. split the queries into one word to six words. 2. attach impression to the split words. 3. regroup all the split words and sum the impressions.

Take the second query "dog common diseases and how to treat them" as an example". It should be split as:

(1) 1-gram: dog, common, diseases, and, how, to, treat, them;
(2) 2-gram: dog common, common diseases, diseases and, and how, how to, to treat, treat them;
(3) 3-gram: dog common diseases, common diseases and, diseases and how, and how to, how to treat, to treat them;
(4) 4-gram: dog common diseases and, common diseases and how, diseases and how to, and how to treat, how to treat them;
(5) 5-gram: dog common diseases and how, the queries into one word, diseases and how to treat, and how to treat them;
(6) 6-gram: dog common diseases and how to, common diseases and how to treat, diseases and how to treat them;

Solution

  • Here is a method! Not the most efficient, but, let's not optimize prematurely. The idea is to use apply to get a new pd.DataFrame with new columns for all ngrams, join this with the old dataframe, and do some stacking and grouping.

    import pandas as pd
    
    df = pd.DataFrame({
        "squery": ["how to feed a dog", "dog habits", "to cat or not to cat", "dog owners"],
        "count": [1000, 200, 100, 150]
    })
    
    def n_grams(txt):
        grams = list()
        words = txt.split(' ')
        for i in range(len(words)):
            for k in range(1, len(words) - i + 1):
                grams.append(" ".join(words[i:i+k]))
        return pd.Series(grams)
    
    counts = df.squery.apply(n_grams).join(df)
    
    counts.drop("squery", axis=1).set_index("count").unstack()\
        .rename("ngram").dropna().reset_index()\
        .drop("level_0", axis=1).groupby("ngram")["count"].sum()
    

    This last expression will return a pd.Series like below.

        ngram
    a                       1000
    a dog                   1000
    cat                      200
    cat or                   100
    cat or not               100
    cat or not to            100
    cat or not to cat        100
    dog                     1350
    dog habits               200
    dog owners               150
    feed                    1000
    feed a                  1000
    feed a dog              1000
    habits                   200
    how                     1000
    how to                  1000
    how to feed             1000
    how to feed a           1000
    how to feed a dog       1000
    not                      100
    not to                   100
    not to cat               100
    or                       100
    or not                   100
    or not to                100
    or not to cat            100
    owners                   150
    to                      1200
    to cat                   200
    to cat or                100
    to cat or not            100
    to cat or not to         100
    to cat or not to cat     100
    to feed                 1000
    to feed a               1000
    to feed a dog           1000
    

    Spiffy method

    This one is a bit more efficient probably, but it still does materialize the dense n-gram vector from CountVectorizer. It multiplies that one on each column with the number of impressions, and then adds over the columns to get a total number of impressions per ngram. It gives the same result as above. One thing to note is that a query that has a repeated ngram also counts double.

    import numpy as np
    from sklearn.feature_extraction.text import CountVectorizer
    
    cv = CountVectorizer(ngram_range=(1, 5))
    ngrams = cv.fit_transform(df.squery)
    mask = np.repeat(df['count'].values.reshape(-1, 1), repeats = len(cv.vocabulary_), axis = 1)
    index = list(map(lambda x: x[0], sorted(cv.vocabulary_.items(), key = lambda x: x[1])))
    pd.Series(np.multiply(mask, ngrams.toarray()).sum(axis = 0), name = "counts", index = index)