Search code examples
scikit-learnnltkstop-wordslemmatizationcountvectorizer

Lemmatization on CountVectorizer doesn't remove Stopwords


I'm trying to add Lematization to CountVectorizer from Skit-learn,as follows

import nltk
from pattern.es import lemma
from nltk import word_tokenize
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import CountVectorizer
from nltk.stem import WordNetLemmatizer

class LemmaTokenizer(object):
    def __call__(self, text):
        return [lemma(t) for t in word_tokenize(text)]

vectorizer = CountVectorizer(stop_words=stopwords.words('spanish'),tokenizer=LemmaTokenizer())

sentence = ["EVOLUCIÓN de los sucesos y la EXPANSIÓN, ellos juegan y yo les dije lo que hago","hola, qué tal vas?"]

vectorizer.fit_transform(sentence)

This is the output:

[u',', u'?', u'car', u'decir', u'der', u'evoluci\xf3n', u'expansi\xf3n', u'hacer', u'holar', u'ir', u'jugar', u'lar', u'ler', u'sucesos', u'tal', u'yar']

UPDATED

This is the Stopwords that appears and has been lemmatized:

u'lar', u'ler', u'der'

It lemmatice all words and doesn't remove Stopwords. So, any idea?


Solution

  • Thats because lemmatization is done before stop word removal. And then the lemmatized stopwords are not found in the stopwords set provided by stopwords.words('spanish').

    For complete working order of CountVectorizer, please refer to my other answer here. Its about TfidfVectorizer but the order is same. In that answer, step 3 is the lemmatization and step 4 is stopword removal.

    So now to remove the stopwords, you have two options:

    1) You lemmatize the stopwords set itself, and then pass it to stop_words param in CountVectorizer.

    my_stop_words = [lemma(t) for t in stopwords.words('spanish')]
    vectorizer = CountVectorizer(stop_words=my_stop_words, 
                                 tokenizer=LemmaTokenizer())
    

    2) Include the stop word removal in the LemmaTokenizer itself.

    class LemmaTokenizer(object):
        def __call__(self, text):
            return [lemma(t) for t in word_tokenize(text) if t not in stopwords.words('spanish')]
    

    Try these and comment if not working.