Search code examples
pythonnlptf-idfnlp-question-answering

(TF-IDF)How to return the five related article after calculating cosine similarity


I get a dataframe sample_df(4 columns: paper_id,title,abstract,body_text). I extracted the abstract column(~1000 words per abstract) and apply the text cleaning process. Here's my question:

After finished calculating the cosine similarity between question and abstract, how can it return the top5 articles score with corresponding information(e.g. paper_id,title,body_text) since my goal is to do tf -idf question answering.

I'm really sorry that my english is poor and I am new to nlp. I would appreciated if someone can help.

from sklearn.metrics.pairwise import linear_kernel
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.metrics.pairwise import cosine_similarity  

txt_cleaned = get_cleaned_text(sample_df,sample_df['abstract'])
question = ['Can covid19 transmit through air']

tfidf_vector = TfidfVectorizer()

tfidf = tfidf_vector.fit_transform(txt_cleaned)

tfidf_question = tfidf_vector.transform(question)
cosine_similarities = linear_kernel(tfidf_question,tfidf).flatten()

related_docs_indices = cosine_similarities.argsort()[:-5:-1]
cosine_similarities[related_docs_indices]

#output([0.18986527, 0.18339485, 0.14951123, 0.13441914]) 

Solution

  • First: if you want 5 articles then instead of [:-5:-1] you have to use [:-6:-1] because for negative values it works little different.

    Or use [::-1][:5] - [::-1] will reverse all values and then you can use normal [:5]


    When you have related_docs_indices then you can use .iloc[] to get elements from DataFrame

     sample_df.iloc[ related_docs_indices ]
    

    If you will have elements with the same similarity then it will gives them in reversed order.


    BTW:

    You can also add similarities to DataFrame

    sample_df['similarity'] = cosine_similarities
    

    and then sort (reversed) and get 5 items.

    sample_df.sort_values('similarity', ascending=False)[:5]
    

    If you will have elements with the same similarity then it will gives them in original order.


    Minimal working code with some data - so everyone can copy and test it.

    Because I have only 5 elements in DataFrame so I search 2 elements.

    from sklearn.metrics.pairwise import linear_kernel
    from sklearn.feature_extraction.text import TfidfVectorizer
    from sklearn.feature_extraction.text import TfidfTransformer
    from sklearn.metrics.pairwise import cosine_similarity  
    
    import pandas as pd
    
    sample_df = pd.DataFrame({
        'paper_id': [1, 2, 3, 4, 5],
        'title': ['Covid19', 'Flu', 'Cancer', 'Covid19 Again', 'New Air Conditioners'],
        'abstract': ['covid19', 'flu', 'cancer', 'covid19', 'air conditioner'],
        'body_text': ['Hello covid19', 'Hello flu', 'Hello cancer', 'Hello covid19 again', 'Buy new air conditioner'],
    })
    
    def get_cleaned_text(df, row):
        return row
    
    txt_cleaned = get_cleaned_text(sample_df, sample_df['abstract'])
    question = ['Can covid19 transmit through air']
    
    tfidf_vector = TfidfVectorizer()
    
    tfidf = tfidf_vector.fit_transform(txt_cleaned)
    
    tfidf_question = tfidf_vector.transform(question)
    cosine_similarities = linear_kernel(tfidf_question,tfidf).flatten()
    
    sample_df['similarity'] = cosine_similarities
    
    number = 2
    #related_docs_indices = cosine_similarities.argsort()[:-(number+1):-1]
    related_docs_indices = cosine_similarities.argsort()[::-1][:number]
    
    print('index:', related_docs_indices)
    
    print('similarity:', cosine_similarities[related_docs_indices])
    
    print('\n--- related_docs_indices ---\n')
    
    print(sample_df.iloc[related_docs_indices])
    
    print('\n--- sort_values ---\n')
    
    print( sample_df.sort_values('similarity', ascending=False)[:number] )
    

    Result:

    index: [3 0]
    similarity: [0.62791376 0.62791376]
    
    --- related_docs_indices ---
    
       paper_id          title abstract            body_text  similarity
    3         4  Covid19 Again  covid19  Hello covid19 again    0.627914
    0         1        Covid19  covid19        Hello covid19    0.627914
    
    --- sort_values ---
    
       paper_id          title abstract            body_text  similarity
    0         1        Covid19  covid19        Hello covid19    0.627914
    3         4  Covid19 Again  covid19  Hello covid19 again    0.627914