Search code examples
pythonscikit-learntfidfvectorizer

Analyzer ignoring certain word when used in Sklearn Tfidf


Here is my code:

def ngrams(string, n=4):
    string = re.sub(r'[,-./]|\sBD',r'', string)
    ngrams = zip(*[string[i:] for i in range(n)])
    R = [''.join(ngram) for ngram in ngrams]
    if len(R) == 0:
        return string
    else:
        return R

L = ['a', 'aa', 'aaa', 'a', 'aa', 'aaa']

vectorizer = TfidfVectorizer(min_df = 0, token_pattern='(?u)\\b\\w+\\b', analyzer=ngrams)
tf_idf_matrix = vectorizer.fit_transform(L)

print(vectorizer.vocabulary_)

The output of vocabulary is {'a': 0}.

I am confused where are "aa" and "aaa" and when you check my ngrams function, I am returning string if it's length is less then the parameter (which is 4 in above code).

The token regex is also made in a way to accept single character.


Solution

  • This is a theory.

    I believe TfidVectorizer expects the analyzer function to return a sequence. Notice the inputs vs outputs of your ngrams function:

    'a'  -> 'a'
    'aa' -> 'aa'
    'aaa' -> 'aaa'
    'aaaa' -> ['aaaa']
    'aaaaa' -> ['aaaa','aaaa']
    

    A string is a sequence, so in the first 3 cases, you are returning a sequence that consists of repeats of the single letter 'a'.

    If my theory is correct, you need to replace

            return string
    

    with

            return [string]