Search code examples
pythonmachine-learningscikit-learnnlpn-gram

Getting n gram suffix using sklearn count vectorizer


I am trying to get 1,2,3 gram suffix for a word and use them as features in my model.

Example,

word = "Apple"
 1 gram suffix = 'e'
 2 gram suffix = 'le'
 3 gram suffix = 'ple'

I have used CountVectorizer in sklearn with ngram_range=(1,3) but that gives all the n grams. I just need the n gram suffixes.

How can I do that?

Also, I'm new to NLP and have no clue how to use these n grams as features in my ML model. How can I convert these "string" n-gram features to some sort of numeric representation so that I can use them in my model.

Can someone please help me out?


Solution

  • Yo can define a custom analyzer to define how the features are obtained from the input. For your case, a simple lambda function to obtain the suffixes from a word will suffice:

    from sklearn.feature_extraction.text import CountVectorizer
    
    word = ["Orange","Apple", "I"]
    n=3
    vect = CountVectorizer(analyzer=lambda x: (x[-i-1:] for i in range(0,min(n,len(x)))))
    mat = vect.fit_transform(word).todense()
    

    Now if we construct a dataframe from the resulting vectorized matrix:

    pd.DataFrame(mat, columns=vect.get_feature_names())
    
       I  e  ge  le  nge  ple
    0  0  1   1   0    1    0
    1  0  1   0   1    0    1
    2  1  0   0   0    0    0