Search code examples
pythonpython-3.xnltkwordnetlemmatization

Does the lemmatization mechanism reduce the size of the corpus?


Dear Community Members,

During the pre-processing of data, after splitting the raw_data into tokens, I have used the popular WordNet Lemmatizer to generate the stems. I am performing experiments on a dataset that has 18953 tokens.

My question is, does the lemmatization process reduce the size of corpus? I am confused, kindly help in this regard. Any help is appreciated!


Solution

  • Lemmatization converts each token (aka form) in the sentence into its lemma form (aka type):

    >>> from nltk import word_tokenize
    >>> from pywsd.utils import lemmatize_sentence
    
    >>> text = ['This is a corpus with multiple sentences.', 'This was the second sentence running.', 'For some reasons, there is a need to second foo bar ran.']
    
    >>> lemmatize_sentence(text[0]) # Lemmatized sentence example.
    ['this', 'be', 'a', 'corpus', 'with', 'multiple', 'sentence', '.']
    >>> word_tokenize(text[0]) # Tokenized sentence example. 
    ['This', 'is', 'a', 'corpus', 'with', 'multiple', 'sentences', '.']
    >>> word_tokenize(text[0].lower()) # Lowercased and tokenized sentence example.
    ['this', 'is', 'a', 'corpus', 'with', 'multiple', 'sentences', '.']
    

    If we lemmatize the sentence, each token should receive the corresponding lemma form, so the no. of "words" remains the same whether it's the form or the type:

    >>> num_tokens = sum([len(word_tokenize(sent.lower())) for sent in text])
    >>> num_lemmas = sum([len(lemmatize_sentence(sent)) for sent in text])
    >>> num_tokens, num_lemmas
    (29, 29)
    
    
    >>> [lemmatize_sentence(sent) for sent in text] # lemmatized sentences
    [['this', 'be', 'a', 'corpus', 'with', 'multiple', 'sentence', '.'], ['this', 'be', 'the', 'second', 'sentence', 'running', '.'], ['for', 'some', 'reason', ',', 'there', 'be', 'a', 'need', 'to', 'second', 'foo', 'bar', 'ran', '.']]
    
    >>> [word_tokenize(sent.lower()) for sent in text] # tokenized sentences
    [['this', 'is', 'a', 'corpus', 'with', 'multiple', 'sentences', '.'], ['this', 'was', 'the', 'second', 'sentence', 'running', '.'], ['for', 'some', 'reasons', ',', 'there', 'is', 'a', 'need', 'to', 'second', 'foo', 'bar', 'ran', '.']]
    

    The "compression" per-se would refer to the number of unique tokens represented in the whole corpus after you've lemmatized the sentences, e.g.

    >>> lemma_vocab = set(chain(*[lemmatize_sentence(sent) for sent in text]))
    >>> token_vocab = set(chain(*[word_tokenize(sent.lower()) for sent in text]))
    >>> len(lemma_vocab), len(token_vocab)
    (21, 23)
    
    >>> lemma_vocab
    {'the', 'this', 'to', 'reason', 'for', 'second', 'a', 'running', 'some', 'sentence', 'be', 'foo', 'ran', 'with', '.', 'need', 'multiple', 'bar', 'corpus', 'there', ','}
    >>> token_vocab
    {'the', 'this', 'to', 'for', 'sentences', 'a', 'second', 'running', 'some', 'is', 'sentence', 'foo', 'reasons', 'with', 'ran', '.', 'need', 'multiple', 'bar', 'corpus', 'there', 'was', ','}
    

    Note: Lemmatization is a pre-processing step. But it should not overwrite your original corpus with the lemmatize forms.