Search code examples
pandascountvectorizer

How do I do a weighted unigram/bigram/trigram using CountVectorizer with weights (from a column value) instead of count?


My dataset contains a block of text as well as a column with a summarized count and it looks like this:

text, count (column name)

this is my home,100

where am i,10

this is a piece of cake, 2

Code that I have gotten via internet to construct an unigram

def get_top_n_words(corpus, n=None):
    vec = sk.feature_extraction.text.CountVectorizer().fit(corpus)
    bag_of_words = vec.transform(corpus)
    sum_words = bag_of_words.sum(axis=0) 
    words_freq = [(word, sum_words[0, idx]) for word, idx in vec.vocabulary_.items()]
    words_freq =sorted(words_freq, key = lambda x: x[1], reverse=True)
    return words_freq[:n]
common_words = get_top_n_words(df['text'], 20)

With a standard CountVectorizer, I would produce an unigram that looks like :

this 2

is 2

my 1

where 1

am 1 i 1

a 1

piece 1

of 1

cake 1

I am hoping that it can be weighted by its count instead since its a summarized count i.e.:

this 102

is 102

my 100

where 10

am 10

i 10

a 2

piece 2

of 2

cake 2

Is this possible?


Solution

  • What you can do is using toarray method after the transform to be able to do matrix multiplication with the count value after:

    def get_top_n_words(corpus, count, n=None): # add the parameter with the count values
        vec = feature_extraction.text.CountVectorizer().fit(corpus)
        # here multiply the toarray of transform with the count values
        bag_of_words = vec.transform(corpus).toarray()*count.values[:,None] 
        sum_words = bag_of_words.sum(axis=0) 
        # accessing the value in sum_words is a bit different but still related to idx
        words_freq = [(word, sum_words[idx]) for word, idx in vec.vocabulary_.items()] 
        words_freq =sorted(words_freq, key = lambda x: x[1], reverse=True)
        return words_freq[:n]
    
    common_words = get_top_n_words(df['text'], df['count'], 20)
    print (common_words)
    [('this', 102),
     ('is', 102),
     ('my', 100),
     ('home', 100),
     ('where', 10),
     ('am', 10),
     ('piece', 2),
     ('of', 2),
     ('cake', 2)]