Wrong value in text comparison

I am having some difficulties in finding text matching in the below dataset (note that Sim is my current output and it is generated by running the code below. It shows the wrong match).

    ID      Text                                                   Sim
13  fsad    amazing  ...                                           fsd
14  fdsdf   best sport everand the gane of the year❤️❤️❤️❤️...     fdsfdgte3e
18  gsd     wonderful                                              fast 
21  dfsfs   i love this its incredible ...                         reds
23  gwe     wonderful end ever seen you ...                        add
... ... ... ...
261 add     wonderful                                              gwe
261 add     wonderful                                              gsd
261 add     wonderful                                              fdsdf
267 fdsfdgte3e  best match ever its a masterpiece                  fdsdf
277 hgdfgre terrible destroys everything ...                       tm28

As shown above, Sim does not give the ID who wrote the text that match. For example, add should match with gsd and vice versa. But my output says that add matches with gwe and this is not true.

The code I am using is the following:

    from fuzzywuzzy import fuzz
    
        def sim (nm, df): # this function finds matches between texts based on a threshold, which is 100. The logic is fuzzywuzzy, specifically partial ratio. The output should be IDs whether texts match, based on the threshold.
            matches = dataset.apply(lambda row: ((fuzz.partial_ratio(row['Text'], nm)) = 100), axis=1)
            return [df.ID[i] for i, x in enumerate(matches) if x]
    
    df['L_Text']=df['Text'].str.lower() 
    df['Sim']=df.apply(lambda row: sim(row['L_Text'], df), axis=1)
    df=df.assign(
        Sim = df.apply(lambda x: [s for s in x['Sim'] if s != x['ID']], axis=1)
    )

def tr (row): # this function assign a similarity score for each text applying partial_ratio similarity
    return (df.loc[:row.name-1, 'L_Text']
                    .apply(lambda name: fuzz.partial_ratio(name, row['L_Text'])))

t = (df.loc[1:].apply(tr, axis=1)
         .reindex(index=df.index, 
                  columns=df.index)
         .fillna(0)
         .add_prefix('txt')
     )
t += t.to_numpy().T + np.diag(np.ones(t.shape[0]))

Could you please help me understand the error in my code? Unfortunately I cannot see it.

My expected output would be as follows:

ID      Text                                                   Sim
13  fsad    amazing  ...                                          
14  fdsdf   best sport everand the gane of the year❤️❤️❤️❤️...    
18  gsd     wonderful                                              add 
21  dfsfs   i love this its incredible ...                         
23  gwe     wonderful end ever seen you ...                       
... ... ... ...
261 add     wonderful                                              gsd
261 add     wonderful                                              gsd
261 add     wonderful                                              gsd
267 fdsfdgte3e  best match ever its a masterpiece                 
277 hgdfgre terrible destroys everything ...

as it is set a perfect match (=1) in sim function.

Solution

Initial assumption

First off, as your question was not a hundred percent clear to me, I assume that you would like to have a pairwise comparison of all rows and if the score of the match is >100 you would like to add the key of the matching row. If this is not the case, please correct me.

Syntactic problems

So there are multiple problems with you code above. First, if one would just copy and paste it, it is syntactically not possible to run it. The sim() function should read as follows:

def sim (nm, df): 
    matches = df.apply(lambda row: fuzz.partial_ratio(row['Text'], nm) == 100), axis=1)
    return [df.ID[i] for i, x in enumerate(matches) if x]

notice the df instead of dataset as well as the == instead of the =. I also removed the redundant parentheses for better readability.

Semantic problems

If i then run your code and print t (which does not seem to be the end result), this gives me the following:

   txt0  txt1   txt2  txt3   txt4   txt5   txt6   txt7  txt8  txt9
0   1.0  27.0   12.0  45.0   45.0   12.0   12.0   12.0  27.0  64.0
1  27.0   1.0   33.0  33.0   42.0   33.0   33.0   33.0  52.0  44.0
2  12.0  33.0    1.0  22.0  100.0  100.0  100.0  100.0  22.0  33.0
3  45.0  33.0   22.0   1.0   41.0   22.0   22.0   22.0  40.0  30.0
4  45.0  42.0  100.0  41.0    1.0  100.0  100.0  100.0  35.0  47.0
5  12.0  33.0  100.0  22.0  100.0    1.0  100.0  100.0  22.0  33.0
6  12.0  33.0  100.0  22.0  100.0  100.0    1.0  100.0  22.0  33.0
7  12.0  33.0  100.0  22.0  100.0  100.0  100.0    1.0  22.0  33.0
8  27.0  52.0   22.0  40.0   35.0   22.0   22.0   22.0   1.0  34.0
9  64.0  44.0   33.0  30.0   47.0   33.0   33.0   33.0  34.0   1.0

which seems correct to me, as fuzz.partial_ratio("wonderful end ever seen you", "wonderful") returns 100 (as a partial match is already considered a score of 100). For consistency reasons you could change

t += t.to_numpy().T + np.diag(np.ones(t.shape[0]))

t += t.to_numpy().T + np.diag(np.ones(t.shape[0])) * 100

as all elements should perfectly match themselves. So when you said

But my output says that add matches with gwe and this is not true.

this would be true in the sense that fuzz.partial_ratio(), you might want to consider using fuzz.ratio() instead. Also, there might be an error when converting t to the new Sim column, but there seems to be no code in the provided example.

Alternative implementation

Also, as some comments suggested, sometimes it is helpful to restructure your code, so that it is easier for people to help you. Here is an example of how this could look like:

import re

import pandas as pd
from fuzzywuzzy import fuzz

data = """
13   fsad        amazing ...                                           fsd
14   fdsdf       best sport everand the gane of the year❤️❤️❤️❤️...    fdsfdgte3e
18   gsd         wonderful                                             fast 
21   dfsfs       i love this its incredible ...                        reds
23   gwe         wonderful end ever seen you ...                       add
261  add         wonderful                                             gwe
261  add         wonderful                                             gsd
261  add         wonderful                                             fdsdf
267  fdsfdgte3e  best match ever its a masterpiece                     fdsdf
277  hgdfgre     terrible destroys everything ...                      tm28
"""

rows = data.strip().split('\n')
records = [[element for element in re.split(r' {2,}', row) if element != ''] for row in rows]

df = pd.DataFrame.from_records(records, columns=['RowNumber', 'ID', 'Text', 'IncorrectSim'], index='RowNumber')
df = df.drop('IncorrectSim', axis=1)
df = df.drop_duplicates(subset=["ID", "Text"])  # Assuming that there is no point in keeping duplicate rows
df = df.set_index('ID')  # Assuming that the "ID" column holds a unique ID

comparison_df = df.copy()
comparison_df['Text'] = comparison_df["Text"].str.lower()
comparison_df['Tmp'] = 1
# This gives us all possible row combinations
comparison_df = comparison_df.reset_index().merge(comparison_df.reset_index(), on='Tmp').drop('Tmp', axis=1)
comparison_df = comparison_df[comparison_df['ID_x'] != comparison_df['ID_y']]  # We only want rows that do not match itself
comparison_df['matchScore'] = comparison_df.apply(lambda row: fuzz.partial_ratio(row['Text_x'], row['Text_y']), axis=1)
comparison_df = comparison_df[comparison_df['matchScore'] == 100]  # only keep perfect matches
comparison_df = comparison_df[['ID_x', 'ID_y']].rename(columns={'ID_x': 'ID', 'ID_y': 'Sim'}).set_index('ID')  # Cleanup

result = df.join(comparison_df, how='left').fillna('')
print(result.to_string())

gives:

                                                         Text  Sim
ID                                                                
add                                                 wonderful  gsd
add                                                 wonderful  gwe
dfsfs                          i love this its incredible ...     
fdsdf       best sport everand the gane of the year❤️❤️❤️❤...     
fdsfdgte3e                  best match ever its a masterpiece     
fsad                                              amazing ...     
gsd                                                 wonderful  gwe
gsd                                                 wonderful  add
gwe                           wonderful end ever seen you ...  gsd
gwe                           wonderful end ever seen you ...  add
hgdfgre                      terrible destroys everything ...