Search code examples
machine-learningnlp

what is the difference between bigram and unigram text features extraction


I searched online to do bi-gram and unigram text features' extraction, but still didn't find something useful information, can someone tell me what is the difference between them?

For example, if I have a text "I have a lovely dog" what will happen if I use bi-gram way to do features extraction and to do unigram extraction?


Solution

  • We are trying to teach machine how to do natural language processing. We human can understand language easily but machines cannot so we trying to teach them specific pattern of language. As specific word has meaning but when we combine the words(i.e group of words) than it will be more helpful to understand the meaning.

    n-gram is basically set of occurring words within given window so when

    • n=1 it is Unigram

    • n=2 it is bigram

    • n=3 it is trigram and so on

    Now suppose machine try to understand the meaning of sentence "I have a lovely dog" then it will split sentences into a specific chunk.

    1. It will consider word one by one which is unigram so each word will be a gram.

      "I", "have", "a" , "lovely" , "dog"

    2. It will consider two words at a time so it will be bigram so each two adjacent words will be bigram

      "I have" , "have a" , "a lovely" , "lovely dog"

    So like this machine will split sentences into small group of words to understand its meaning