Search code examples
pythonpandasfuzzywuzzy

Wrong value in text comparison


I am having some difficulties in finding text matching in the below dataset (note that Sim is my current output and it is generated by running the code below. It shows the wrong match).

    ID      Text                                                   Sim
13  fsad    amazing  ...                                           fsd
14  fdsdf   best sport everand the gane of the year❤️❤️❤️❤️...     fdsfdgte3e
18  gsd     wonderful                                              fast 
21  dfsfs   i love this its incredible ...                         reds
23  gwe     wonderful end ever seen you ...                        add
... ... ... ...
261 add     wonderful                                              gwe
261 add     wonderful                                              gsd
261 add     wonderful                                              fdsdf
267 fdsfdgte3e  best match ever its a masterpiece                  fdsdf
277 hgdfgre terrible destroys everything ...                       tm28

As shown above, Sim does not give the ID who wrote the text that match. For example, add should match with gsd and vice versa. But my output says that add matches with gwe and this is not true.

The code I am using is the following:

    from fuzzywuzzy import fuzz
    
        def sim (nm, df): # this function finds matches between texts based on a threshold, which is 100. The logic is fuzzywuzzy, specifically partial ratio. The output should be IDs whether texts match, based on the threshold.
            matches = dataset.apply(lambda row: ((fuzz.partial_ratio(row['Text'], nm)) = 100), axis=1)
            return [df.ID[i] for i, x in enumerate(matches) if x]
    
    df['L_Text']=df['Text'].str.lower() 
    df['Sim']=df.apply(lambda row: sim(row['L_Text'], df), axis=1)
    df=df.assign(
        Sim = df.apply(lambda x: [s for s in x['Sim'] if s != x['ID']], axis=1)
    )

def tr (row): # this function assign a similarity score for each text applying partial_ratio similarity
    return (df.loc[:row.name-1, 'L_Text']
                    .apply(lambda name: fuzz.partial_ratio(name, row['L_Text'])))

t = (df.loc[1:].apply(tr, axis=1)
         .reindex(index=df.index, 
                  columns=df.index)
         .fillna(0)
         .add_prefix('txt')
     )
t += t.to_numpy().T + np.diag(np.ones(t.shape[0]))

Could you please help me understand the error in my code? Unfortunately I cannot see it.

My expected output would be as follows:

ID      Text                                                   Sim
13  fsad    amazing  ...                                          
14  fdsdf   best sport everand the gane of the year❤️❤️❤️❤️...    
18  gsd     wonderful                                              add 
21  dfsfs   i love this its incredible ...                         
23  gwe     wonderful end ever seen you ...                       
... ... ... ...
261 add     wonderful                                              gsd
261 add     wonderful                                              gsd
261 add     wonderful                                              gsd
267 fdsfdgte3e  best match ever its a masterpiece                 
277 hgdfgre terrible destroys everything ... 

                 

as it is set a perfect match (=1) in sim function.


Solution

  • Initial assumption

    First off, as your question was not a hundred percent clear to me, I assume that you would like to have a pairwise comparison of all rows and if the score of the match is >100 you would like to add the key of the matching row. If this is not the case, please correct me.

    Syntactic problems

    So there are multiple problems with you code above. First, if one would just copy and paste it, it is syntactically not possible to run it. The sim() function should read as follows:

    def sim (nm, df): 
        matches = df.apply(lambda row: fuzz.partial_ratio(row['Text'], nm) == 100), axis=1)
        return [df.ID[i] for i, x in enumerate(matches) if x]
    

    notice the df instead of dataset as well as the == instead of the =. I also removed the redundant parentheses for better readability.

    Semantic problems

    If i then run your code and print t (which does not seem to be the end result), this gives me the following:

       txt0  txt1   txt2  txt3   txt4   txt5   txt6   txt7  txt8  txt9
    0   1.0  27.0   12.0  45.0   45.0   12.0   12.0   12.0  27.0  64.0
    1  27.0   1.0   33.0  33.0   42.0   33.0   33.0   33.0  52.0  44.0
    2  12.0  33.0    1.0  22.0  100.0  100.0  100.0  100.0  22.0  33.0
    3  45.0  33.0   22.0   1.0   41.0   22.0   22.0   22.0  40.0  30.0
    4  45.0  42.0  100.0  41.0    1.0  100.0  100.0  100.0  35.0  47.0
    5  12.0  33.0  100.0  22.0  100.0    1.0  100.0  100.0  22.0  33.0
    6  12.0  33.0  100.0  22.0  100.0  100.0    1.0  100.0  22.0  33.0
    7  12.0  33.0  100.0  22.0  100.0  100.0  100.0    1.0  22.0  33.0
    8  27.0  52.0   22.0  40.0   35.0   22.0   22.0   22.0   1.0  34.0
    9  64.0  44.0   33.0  30.0   47.0   33.0   33.0   33.0  34.0   1.0
    

    which seems correct to me, as fuzz.partial_ratio("wonderful end ever seen you", "wonderful") returns 100 (as a partial match is already considered a score of 100). For consistency reasons you could change

    t += t.to_numpy().T + np.diag(np.ones(t.shape[0]))
    

    to

    t += t.to_numpy().T + np.diag(np.ones(t.shape[0])) * 100
    

    as all elements should perfectly match themselves. So when you said

    But my output says that add matches with gwe and this is not true.

    this would be true in the sense that fuzz.partial_ratio(), you might want to consider using fuzz.ratio() instead. Also, there might be an error when converting t to the new Sim column, but there seems to be no code in the provided example.

    Alternative implementation

    Also, as some comments suggested, sometimes it is helpful to restructure your code, so that it is easier for people to help you. Here is an example of how this could look like:

    import re
    
    import pandas as pd
    from fuzzywuzzy import fuzz
    
    data = """
    13   fsad        amazing ...                                           fsd
    14   fdsdf       best sport everand the gane of the year❤️❤️❤️❤️...    fdsfdgte3e
    18   gsd         wonderful                                             fast 
    21   dfsfs       i love this its incredible ...                        reds
    23   gwe         wonderful end ever seen you ...                       add
    261  add         wonderful                                             gwe
    261  add         wonderful                                             gsd
    261  add         wonderful                                             fdsdf
    267  fdsfdgte3e  best match ever its a masterpiece                     fdsdf
    277  hgdfgre     terrible destroys everything ...                      tm28
    """
    
    rows = data.strip().split('\n')
    records = [[element for element in re.split(r' {2,}', row) if element != ''] for row in rows]
    
    df = pd.DataFrame.from_records(records, columns=['RowNumber', 'ID', 'Text', 'IncorrectSim'], index='RowNumber')
    df = df.drop('IncorrectSim', axis=1)
    df = df.drop_duplicates(subset=["ID", "Text"])  # Assuming that there is no point in keeping duplicate rows
    df = df.set_index('ID')  # Assuming that the "ID" column holds a unique ID
    
    comparison_df = df.copy()
    comparison_df['Text'] = comparison_df["Text"].str.lower()
    comparison_df['Tmp'] = 1
    # This gives us all possible row combinations
    comparison_df = comparison_df.reset_index().merge(comparison_df.reset_index(), on='Tmp').drop('Tmp', axis=1)
    comparison_df = comparison_df[comparison_df['ID_x'] != comparison_df['ID_y']]  # We only want rows that do not match itself
    comparison_df['matchScore'] = comparison_df.apply(lambda row: fuzz.partial_ratio(row['Text_x'], row['Text_y']), axis=1)
    comparison_df = comparison_df[comparison_df['matchScore'] == 100]  # only keep perfect matches
    comparison_df = comparison_df[['ID_x', 'ID_y']].rename(columns={'ID_x': 'ID', 'ID_y': 'Sim'}).set_index('ID')  # Cleanup
    
    result = df.join(comparison_df, how='left').fillna('')
    print(result.to_string())
    

    gives:

                                                             Text  Sim
    ID                                                                
    add                                                 wonderful  gsd
    add                                                 wonderful  gwe
    dfsfs                          i love this its incredible ...     
    fdsdf       best sport everand the gane of the year❤️❤️❤️❤...     
    fdsfdgte3e                  best match ever its a masterpiece     
    fsad                                              amazing ...     
    gsd                                                 wonderful  gwe
    gsd                                                 wonderful  add
    gwe                           wonderful end ever seen you ...  gsd
    gwe                           wonderful end ever seen you ...  add
    hgdfgre                      terrible destroys everything ...