python machine-learning scikit-learn nlp n-gram

Getting n gram suffix using sklearn count vectorizer

I am trying to get 1,2,3 gram suffix for a word and use them as features in my model.

Example,

word = "Apple"
 1 gram suffix = 'e'
 2 gram suffix = 'le'
 3 gram suffix = 'ple'

I have used CountVectorizer in sklearn with ngram_range=(1,3) but that gives all the n grams. I just need the n gram suffixes.

How can I do that?

Also, I'm new to NLP and have no clue how to use these n grams as features in my ML model. How can I convert these "string" n-gram features to some sort of numeric representation so that I can use them in my model.

Can someone please help me out?

Solution

Yo can define a custom analyzer to define how the features are obtained from the input. For your case, a simple lambda function to obtain the suffixes from a word will suffice:

from sklearn.feature_extraction.text import CountVectorizer

word = ["Orange","Apple", "I"]
n=3
vect = CountVectorizer(analyzer=lambda x: (x[-i-1:] for i in range(0,min(n,len(x)))))
mat = vect.fit_transform(word).todense()

Now if we construct a dataframe from the resulting vectorized matrix:

pd.DataFrame(mat, columns=vect.get_feature_names())

   I  e  ge  le  nge  ple
0  0  1   1   0    1    0
1  0  1   0   1    0    1
2  1  0   0   0    0    0