Search code examples
pythonscikit-learnnlptf-idf

sklearn TfidfVectorizer custom ngrams without characters from regex pattern


I would like to perform custom ngram vectorization using sklearn TfidfVectorizer. The generated ngrams should not contain any character from a given regex pattern. Unfortunately the custom tokenizer function is completely ignored when analyzer='char' (ngram mode). See the following example:

import re
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer

pattern = re.compile(r'[\.-]'). # split on '.' and on '-'

def tokenize(text):
    return pattern.split(text)

corpus = np.array(['abc.xyz', 'zzz-m.j'])

# word vectorization
tfidf_vectorizer = TfidfVectorizer(tokenizer=tokenize, analyzer='word', stop_words='english')
tfidf_vectorizer.fit_transform(corpus)
print(tfidf_vectorizer.vocabulary_)
# Output -> {'abc': 0, 'xyz': 3, 'zzz': 4, 'm': 2, 'j': 1}
# This is ok!

# ngram vectorization
tfidf_vectorizer = TfidfVectorizer(tokenizer=tokenize, analyzer='char', ngram_range=(2, 2))
tfidf_vectorizer.fit_transform(corpus)
print(tfidf_vectorizer.vocabulary_)
# Output -> {'ab': 3, 'bc': 4, 'c.': 5, '.x': 2, 'xy': 7, 'yz': 8, 'zz': 10, 'z-': 9, '-m': 0, 'm.': 6, '.j': 1}
# This is not ok! I don't want ngrams to include the '.' and '-' chars used for tokenization

What is the best way to do it?


Solution

  • I've written the following solution using nltk:

    import re
    from nltk.util import ngrams
    from sklearn.feature_extraction.text import TfidfVectorizer
    
    pattern = re.compile(r'[\.-]'). # split on '.' and on '-'
    
    corpus = np.array(['abc.xyz', 'zzz-m.j'])
    
    
    def analyzer(text):
        text = text.lower()
        tokens = pattern.split(text)    
        return [''.join(ngram) for token in tokens for ngram in ngrams(token, 2)]
    
    tfidf_vectorizer = TfidfVectorizer(analyzer=analyzer)
    tfidf_vectorizer.fit_transform(corpus)
    print(tfidf_vectorizer.vocabulary_)
    
    # Output -> {'ab': 0, 'bc': 1, 'xy': 2, 'yz': 3, 'zz': 4}
    
    

    Not sure if this is the best way to go though.