Search code examples
pythonrmachine-learningnlp

Combining Unigram and Bigram in TF-IDF


I am working on a project and we are trying to produce a TF-IDF on a corpus of title of articles divided in multiple clusters. Our goal is to make so it contains the most significant unigrams AND bigrams at the same time for every clusters. Our plan is this. We first identify the most probable bigrams in our corpus. With that list, we then count the frequency of those bigrams in every clusters. What we want to do next, and that's where our problem lies, is to make sure we don't count words in those bigram twice. Let's say a popular bigram is 'climate change'. The bigram 'climate change' has a frequency of 6 in our corpus, but the word 'climate' has a frequency of 7 (it is alone once) and the word 'change' has a frequency of 8 (it is alone twice). We have to make sure our table with combined unigram and bigram doesn't look like this:

      n_gram          frequency
1: climate change         6
2:        climate         7
3:         change         8

It has to look like this (We substract the 'climate' and 'change' frequencies of 'climate change' to their corresponding unigrams) :

      n_gram          frequency
1: climate change         6
2:        climate         1
3:         change         2

The problem is, if we substract the first and second word frequencies of every bigram to their corresponding unigram, we sometime get negative frequencies for unigram. Our intuition is this : let say that a popular trigram is 'United States America'. Then we will have two frequent bigrams, namely 'United States' and 'States America'. So let's say we have this table at first(without any substraction done) :

    n_gram          frequency
1:  United States        10
2: States America        10
3:         United        11
4:         States        12
5:        America        13

We would then have this table after substracting the bigram frequencies :

       n_gram         frequency
1:  United States        10
2: States America        10
3:         United         1
4:         States        -8
5:        America         3

My questions are : is there a easy way around this that I don't see? And is there any other reasons why we would get negative frequencies by using this method?


Solution

  • If you compute the bigrams first, when you go to compute the unigram frequencies you can ignore incrementing the frequency of any instances of unigrams that are part of significant bigrams. For example, if we have:

    ... Experts in the United States America believe that if we don't tackle climate change now, the climate will cause irreversible damage to America and our planet. In contrast, some people believe that climate change is a hoax invented by the United States America government ..."

    our most freq bigrams are:

      bi_gram         frequency
    1:  United States         2
    2: States America         2
    3: climate change         2
    

    When we compute our unigrams, we can ignore any instances of the unigrams that are part of any of the above bigrams. For example, we can only increment America if it appears without United to it's left, or without States to it's right, making our unigram frequency table (ignoring the other words):

     uni_gram         frequency
    1:          climate       1
    2:           change       1
    3:          America       1