Search code examples
pythonmachine-learningtf-idfn-gramcosine-similarity

Using known python packages for implementing N-Gram, TF-IDF and Cosine similarity


I'm trying to implement a similarity function using

  • N-Grams
  • TF-IDF
  • Cosine Similaity

Example enter image description here

Concept:

words = [...]
word = '...'
similarity = predict(words,word)

def predict(words,word):
     words_ngrams = create_ngrams(words,range=(2,4))  
     word_ngrams =  create_ngrams(word,range=(2,4))

     words_tokenizer = tfidf_tokenizer(words_ngrams)
     word_vec = words_tokenizer.transform(word)

     return cosine_similarity(word_ved,words_tokenizer)

I searched the web for a simple and safe implementation but I couldn't find one that was using known python packages as sklearn, nltk, scipy etc.
most of them using "self made" calculations.

I'm trying to avoid coding every step by hand, and I'm guessing there is an easy fix for all of 'that pipeline'.

any help(and code) would be appreciated. tnx :)


Solution

  • Eventualy I figured it out...

    For who ever will find the need of a solution for this Q, here's a function I wrote that takes care of it...

    '''
    ### N-Gram & TD-IDF & Cosine Similarity
    Using n-gram on 'from column' with TF-IDF to predict the 'to column'.
    Adding to the df a 'cosine_similarity' feature with the numeric result.
    '''
    def add_prediction_by_ngram_tfidf_cosine( from_column_name,ngram_range=(2,4) ):
        global df
        from sklearn.feature_extraction.text import TfidfVectorizer
        from sklearn.metrics.pairwise import cosine_similarity
        vectorizer = TfidfVectorizer( analyzer='char',ngram_range=ngram_range )
        vectorizer.fit(df.FromColumn)
    
        w = from_column_name
        vec_word = vectorizer.transform([w])
    
        df['vec'] = df.FromColumn.apply(lambda x : vectorizer.transform([x]))
        df['cosine_similarity'] = df.vec.apply(lambda x : cosine_similarity(x,vec_word)[0][0])
    
        df = df.drop(['vec'],axis=1)
    

    Note: it's not production ready