Here is my code:
def ngrams(string, n=4):
string = re.sub(r'[,-./]|\sBD',r'', string)
ngrams = zip(*[string[i:] for i in range(n)])
R = [''.join(ngram) for ngram in ngrams]
if len(R) == 0:
return string
else:
return R
L = ['a', 'aa', 'aaa', 'a', 'aa', 'aaa']
vectorizer = TfidfVectorizer(min_df = 0, token_pattern='(?u)\\b\\w+\\b', analyzer=ngrams)
tf_idf_matrix = vectorizer.fit_transform(L)
print(vectorizer.vocabulary_)
The output of vocabulary is {'a': 0}
.
I am confused where are "aa"
and "aaa"
and when you check my ngrams function, I am returning string if it's length is less then the parameter (which is 4 in above code).
The token regex is also made in a way to accept single character.
This is a theory.
I believe TfidVectorizer
expects the analyzer
function to return a sequence. Notice the inputs vs outputs of your ngrams
function:
'a' -> 'a'
'aa' -> 'aa'
'aaa' -> 'aaa'
'aaaa' -> ['aaaa']
'aaaaa' -> ['aaaa','aaaa']
A string is a sequence, so in the first 3 cases, you are returning a sequence that consists of repeats of the single letter 'a'
.
If my theory is correct, you need to replace
return string
with
return [string]