Search code examples
pandasdataframenlptext-classificationtf-idf

euclidian distance from word to sentence after doing Vectorizer


I have dataframe with 1000 text rows.

I did TfidfVectorizer.

Now I want to create a new field which give me the distance from each sentence to the word that i want, lets say the word "king". df['king']

I thought about taking in each sentence the 5 closet words to the word king and make average of them.

I will glad to know how to do that or to hear about another method.


Solution

  • I am not convinced that the Euclidean distance would be the optimal measure. I would actually look at similarity scores:

    import pandas as pd
    from sklearn.feature_extraction.text import TfidfVectorizer
    from sklearn.metrics.pairwise import cosine_similarity
    import numpy as np
    
    data = {
        'text': [
            "The king sat on the throne with wisdom.",
            "A queen ruled the kingdom alongside the king.",
            "Knights were loyal to their king.",
            "The empire prospered under the rule of a wise monarch."
        ]
    }
    df = pd.DataFrame(data)
    
    tfidf = TfidfVectorizer()
    tfidf_matrix = tfidf.fit_transform(df['text'])
    
    try:
        king_vector = tfidf.transform(["king"]).toarray()
    except KeyError:
        print("The word 'king' is not in the vocabulary.")
        king_vector = np.zeros((1, tfidf_matrix.shape[1]))
    
    similarities = cosine_similarity(tfidf_matrix, king_vector).flatten()
    
    feature_names = np.array(tfidf.get_feature_names_out())
    
    def get_top_n_words(row_vector, top_n=5):
        indices = row_vector.argsort()[::-1][:top_n]
        return feature_names[indices]
    
    averages = []
    for i in range(tfidf_matrix.shape[0]):
        sentence_vector = tfidf_matrix[i].toarray().flatten()
        top_words = get_top_n_words(sentence_vector)
        top_similarities = [cosine_similarity(tfidf.transform([word]), king_vector).flatten()[0] for word in top_words]
        averages.append(np.mean(top_similarities))
    
    df['king_similarity'] = similarities
    df['avg_closest_similarity'] = averages
    
    print(df)
    

    which would give you

                                                    text  king_similarity  \
    0            The king sat on the throne with wisdom.         0.240614   
    1      A queen ruled the kingdom alongside the king.         0.259779   
    2                  Knights were loyal to their king.         0.274487   
    3  The empire prospered under the rule of a wise ...         0.000000   
    
       avg_closest_similarity  
    0                     0.0  
    1                     0.0  
    2                     0.0  
    3                     0.0  
    

    That being said, if you absolutely want to focus on Euclidean distance, here is a method:

    import pandas as pd
    from sklearn.feature_extraction.text import TfidfVectorizer
    import numpy as np
    from scipy.spatial.distance import euclidean
    
    data = {
        'text': [
            "The king sat on the throne with wisdom.",
            "A queen ruled the kingdom alongside the king.",
            "Knights were loyal to their king.",
            "The empire prospered under the rule of a wise monarch."
        ]
    }
    df = pd.DataFrame(data)
    
    tfidf = TfidfVectorizer()
    tfidf_matrix = tfidf.fit_transform(df['text']).toarray()
    
    feature_names = tfidf.get_feature_names_out()
    if "king" in feature_names:
        king_index = np.where(feature_names == "king")[0][0]
        king_vector = np.zeros_like(tfidf_matrix[0])
        king_vector[king_index] = 1
    else:
        print("The word 'king' is not in the vocabulary.")
        king_vector = np.zeros_like(tfidf_matrix[0])
    
    df['king_distance'] = [euclidean(sentence_vector, king_vector) for sentence_vector in tfidf_matrix]
    
    print(df)
    
    

    which gives

                                                    text  king_distance
    0            The king sat on the throne with wisdom.       1.232385
    1      A queen ruled the kingdom alongside the king.       1.216734
    2                  Knights were loyal to their king.       1.204586
    3  The empire prospered under the rule of a wise ...       1.414214