Search code examples
pythonscikit-learnstemmingdocument-classificationtfidfvectorizer

How to apply a custom stemmer before passing the training corpus to TfidfVectorizer in sklearn?


Here is my code, I have a sentence and I want to tokenize and stem it before passing it to TfidfVectorizer to finally to get a tf-idf representation of the sentence:

from sklearn.feature_extraction.text import TfidfVectorizer
import nltk 
from nltk.stem.snowball import SnowballStemmer

stemmer_ita = SnowballStemmer("italian")

def tokenizer_stemmer_ita(text):
    return [stemmer_ita.stem(word) for word in text.split()]

def sentence_tokenizer_stemmer(text):
    return " ".join([stemmer_ita.stem(word) for word in text.split()])

X_train = ['il libro è sul tavolo']

X_train = [sentence_tokenizer_stemmer(text) for text in X_train]

tfidf = TfidfVectorizer(preprocessor=None, tokenizer=None, use_idf=True, stop_words=None, ngram_range=(1,2))
X_train = tfidf.fit_transform(X_train)

# let's see the features
print (tfidf.get_feature_names())

I get as output:

['il', 'il libr', 'libr', 'libr sul', 'sul', 'sul tavol', 'tavol']

if I change the parameter

tokenizer=None

to:

tokenizer=tokenizer_stemmer_ita

and I comment this line:

X_train = [sentence_tokenizer_stemmer(text) for text in X_train]

I expect to get the same result but the result is different:

['il', 'il libr', 'libr', 'libr è', 'sul', 'sul tavol', 'tavol', 'è', 'è sul']

Why? Am I implementing correctly the external stemmer? It seems, at least, that the stopwords ("è") are removed in the first run, even if stop_words=None.

[edit] As suggested by Vivek, the problem seems to be the default token patter, which is applied anyway when tokenizer = None. So if a add these two lines at the beginning of tokenizer_stemmer_ita:

token_pattern = re.compile(u'(?u)\\b\\w\\w+\\b')
text = " ".join( token_pattern.findall(text) )

I should get the correct behaviour, and in fact I get it for the above simple example, but for a different example:

X_train = ['0.05%.\n\nVedete?']

I don't, the two outputs are different:

['05', '05 ved', 'ved']

and

['05', '05 vedete', 'vedete']

why? In this case the question mark seems to be the problem, without it the output are identical.

[edit2] It seems I have to stem first and then apply the regex, in this case the two outputs are identical.


Solution

  • Thats because of default tokenizer pattern token_pattern used in TfidfVectorizer:

    token_pattern : string

    Regular expression denoting what constitutes a “token”, only used if analyzer == 'word'. The default regexp selects tokens of 2 or more alphanumeric characters (punctuation is completely ignored and always treated as a token separator).

    So the character è is not selected.

    import re
    token_pattern = re.compile(u'(?u)\\b\\w\\w+\\b')
    print token_pattern.findall('il libro è sul tavolo')
    
    # Output
    # ['il', 'libro', 'sul', 'tavolo']
    

    This default token_pattern is used when tokenizer is None, as you are experiencing.