Search code examples
python-3.xscikit-learnnlptf-idftfidfvectorizer

Sklearn tf-idf TfidfVectorizer failed to capture one letter words


A particular instance is "Queens Stop 'N' Swap". After transforming, I only got three features ['Queens', 'Stop', 'SWap']. The 'N' has been ignored. How can I capture the 'N'?. All the parameters are default settings in my code.

### Create the vectorizer method
tfidf_vec = TfidfVectorizer()

### Transform the text into tf-iwine vectors
text_tfidf = tfidf_vec.fit_transform(title_text)

Solution

  • You're not getting 'n' as a token because it's not considered a token by default tokenizer:

    from sklearn.feature_extraction.text import TfidfVectorizer
    
    texts = ["Queens Stop 'N' Swap",]
    tfidf = TfidfVectorizer(token_pattern='(?u)\\b\\w\\w+\\b',)
    tfidf.fit(texts)
    tfidf.vocabulary_
    {'queens': 0, 'stop': 1, 'swap': 2}
    

    To capture 1 letter tokens, with capitalzation preserved, change it like:

    tfidf = TfidfVectorizer(token_pattern='(?u)\\b\\w+\\b',lowercase=False)
    tfidf.fit(texts)
    tfidf.vocabulary_
    {'Queens': 1, 'stop': 2, 'N': 0, 'swap': 3}