Search code examples
pythonpandasgroup-bynltksimilarity

How to group text data based on document similarity?


Consider the dataframe like below

df = pd.DataFrame({'Questions': ['What are you doing?','What are you doing tonight?','What are you doing now?','What is your name?','What is your nick name?','What is your full name?','Shall we meet?',
                             'How are you doing?' ]})
                   Questions
0          What are you doing?
1  What are you doing tonight?
2      What are you doing now?
3           What is your name?
4      What is your nick name?
5      What is your full name?
6               Shall we meet?
7           How are you doing?

How to group the dataframe with similar Questions? i.e how to get groups like below

for _, i in df.groupby('similarity')['Questions']:
    print(i,'\n')
6    Shall we meet?
Name: Questions, dtype: object 

3         What is your name?
4    What is your nick name?
5    What is your full name?
Name: Questions, dtype: object 

0            What are you doing?
1    What are you doing tonight?
2        What are you doing now?
7             How are you doing?
Name: Questions, dtype: object 

A similar question was asked here but with less clarity so no aswers for that question


Solution

  • Here's one pretty big approach by finding the normalized similarity score between all the elements in the series and then grouping them by the newly obtained similarity list converted to string. i.e

    import numpy as np
    import nltk
    from nltk.corpus import wordnet as wn
    import pandas as pd
    
    def convert_tag(tag):   
        tag_dict = {'N': 'n', 'J': 'a', 'R': 'r', 'V': 'v'}
        try:
            return tag_dict[tag[0]]
        except KeyError:
            return None
    
    def doc_to_synsets(doc):
        """
        Returns a list of synsets in document.
    
        Tokenizes and tags the words in the document doc.
        Then finds the first synset for each word/tag combination.
        If a synset is not found for that combination it is skipped.
    
        Args:
            doc: string to be converted
    
        Returns:
            list of synsets
    
        Example:
            doc_to_synsets('Fish are nvqjp friends.')
            Out: [Synset('fish.n.01'), Synset('be.v.01'), 
         Synset('friend.n.01')]
        """
    
        synsetlist =[]
        tokens=nltk.word_tokenize(doc)
        pos=nltk.pos_tag(tokens)    
        for tup in pos:
            try:
                synsetlist.append(wn.synsets(tup[0], convert_tag(tup[1]))[0])
            except:
                continue           
        return synsetlist
    
    def similarity_score(s1, s2):
        """
        Calculate the normalized similarity score of s1 onto s2
    
        For each synset in s1, finds the synset in s2 with the largest similarity value.
        Sum of all of the largest similarity values and normalize this value by dividing it by the number of largest similarity values found.
    
        Args:
            s1, s2: list of synsets from doc_to_synsets
    
        Returns:
            normalized similarity score of s1 onto s2
    
        Example:
            synsets1 = doc_to_synsets('I like cats')
            synsets2 = doc_to_synsets('I like dogs')
            similarity_score(synsets1, synsets2)
            Out: 0.73333333333333339
        """
    
        highscores = []
        for synset1 in s1:
            highest_yet=0
            for synset2 in s2:
                try:
                    simscore=synset1.path_similarity(synset2)
                    if simscore>highest_yet:
                        highest_yet=simscore
                except:
                    continue
    
            if highest_yet>0:
                 highscores.append(highest_yet)  
    
        return sum(highscores)/len(highscores)  if len(highscores) > 0 else 0
    
    def document_path_similarity(doc1, doc2):
        synsets1 = doc_to_synsets(doc1)
        synsets2 = doc_to_synsets(doc2)
        return (similarity_score(synsets1, synsets2) + similarity_score(synsets2, synsets1)) / 2
    
    
    def similarity(x,df):
        sim_score = []
        for i in df['Questions']:
            sim_score.append(document_path_similarity(x,i))
        return sim_score
    

    From the above methods defined we can now do

    df['similarity'] = df['Questions'].apply(lambda x : similarity(x,df)).astype(str)
    
    for _, i in df.groupby('similarity')['Questions']:
        print(i,'\n')
    

    Output :

    6    Shall we meet?
    Name: Questions, dtype: object 
    
    3         What is your name?
    4    What is your nick name?
    5    What is your full name?
    Name: Questions, dtype: object 
    
    0            What are you doing?
    1    What are you doing tonight?
    2        What are you doing now?
    7             How are you doing?
    Name: Questions, dtype: object 
    

    This isn't the best approach to the problem, and is really slow. Any new approach is highly appreciated.