I am trying to get 1,2,3 gram suffix for a word and use them as features in my model.
Example,
word = "Apple"
1 gram suffix = 'e'
2 gram suffix = 'le'
3 gram suffix = 'ple'
I have used CountVectorizer
in sklearn with ngram_range=(1,3)
but that gives all the n grams. I just need the n gram suffixes.
How can I do that?
Also, I'm new to NLP and have no clue how to use these n grams as features in my ML model. How can I convert these "string" n-gram features to some sort of numeric representation so that I can use them in my model.
Can someone please help me out?
Yo can define a custom analyzer
to define how the features are obtained from the input. For your case, a simple lambda function to obtain the suffixes from a word will suffice:
from sklearn.feature_extraction.text import CountVectorizer
word = ["Orange","Apple", "I"]
n=3
vect = CountVectorizer(analyzer=lambda x: (x[-i-1:] for i in range(0,min(n,len(x)))))
mat = vect.fit_transform(word).todense()
Now if we construct a dataframe from the resulting vectorized matrix:
pd.DataFrame(mat, columns=vect.get_feature_names())
I e ge le nge ple
0 0 1 1 0 1 0
1 0 1 0 1 0 1
2 1 0 0 0 0 0