I am having some difficulties in finding text matching in the below dataset (note that Sim
is my current output and it is generated by running the code below. It shows the wrong match).
ID Text Sim
13 fsad amazing ... fsd
14 fdsdf best sport everand the gane of the year❤️❤️❤️❤️... fdsfdgte3e
18 gsd wonderful fast
21 dfsfs i love this its incredible ... reds
23 gwe wonderful end ever seen you ... add
... ... ... ...
261 add wonderful gwe
261 add wonderful gsd
261 add wonderful fdsdf
267 fdsfdgte3e best match ever its a masterpiece fdsdf
277 hgdfgre terrible destroys everything ... tm28
As shown above, Sim
does not give the ID
who wrote the text that match.
For example, add
should match with gsd
and vice versa. But my output says that add
matches with gwe
and this is not true.
The code I am using is the following:
from fuzzywuzzy import fuzz
def sim (nm, df): # this function finds matches between texts based on a threshold, which is 100. The logic is fuzzywuzzy, specifically partial ratio. The output should be IDs whether texts match, based on the threshold.
matches = dataset.apply(lambda row: ((fuzz.partial_ratio(row['Text'], nm)) = 100), axis=1)
return [df.ID[i] for i, x in enumerate(matches) if x]
df['L_Text']=df['Text'].str.lower()
df['Sim']=df.apply(lambda row: sim(row['L_Text'], df), axis=1)
df=df.assign(
Sim = df.apply(lambda x: [s for s in x['Sim'] if s != x['ID']], axis=1)
)
def tr (row): # this function assign a similarity score for each text applying partial_ratio similarity
return (df.loc[:row.name-1, 'L_Text']
.apply(lambda name: fuzz.partial_ratio(name, row['L_Text'])))
t = (df.loc[1:].apply(tr, axis=1)
.reindex(index=df.index,
columns=df.index)
.fillna(0)
.add_prefix('txt')
)
t += t.to_numpy().T + np.diag(np.ones(t.shape[0]))
Could you please help me understand the error in my code? Unfortunately I cannot see it.
My expected output would be as follows:
ID Text Sim
13 fsad amazing ...
14 fdsdf best sport everand the gane of the year❤️❤️❤️❤️...
18 gsd wonderful add
21 dfsfs i love this its incredible ...
23 gwe wonderful end ever seen you ...
... ... ... ...
261 add wonderful gsd
261 add wonderful gsd
261 add wonderful gsd
267 fdsfdgte3e best match ever its a masterpiece
277 hgdfgre terrible destroys everything ...
as it is set a perfect match (=1) in sim
function.
First off, as your question was not a hundred percent clear to me, I assume that you would like to have a pairwise comparison of all rows and if the score of the match is >100 you would like to add the key of the matching row. If this is not the case, please correct me.
So there are multiple problems with you code above. First, if one would just copy and paste it, it is syntactically not possible to run it. The sim()
function should read as follows:
def sim (nm, df):
matches = df.apply(lambda row: fuzz.partial_ratio(row['Text'], nm) == 100), axis=1)
return [df.ID[i] for i, x in enumerate(matches) if x]
notice the df
instead of dataset
as well as the ==
instead of the =
. I also removed the redundant parentheses for better readability.
If i then run your code and print t
(which does not seem to be the end result), this gives me the following:
txt0 txt1 txt2 txt3 txt4 txt5 txt6 txt7 txt8 txt9
0 1.0 27.0 12.0 45.0 45.0 12.0 12.0 12.0 27.0 64.0
1 27.0 1.0 33.0 33.0 42.0 33.0 33.0 33.0 52.0 44.0
2 12.0 33.0 1.0 22.0 100.0 100.0 100.0 100.0 22.0 33.0
3 45.0 33.0 22.0 1.0 41.0 22.0 22.0 22.0 40.0 30.0
4 45.0 42.0 100.0 41.0 1.0 100.0 100.0 100.0 35.0 47.0
5 12.0 33.0 100.0 22.0 100.0 1.0 100.0 100.0 22.0 33.0
6 12.0 33.0 100.0 22.0 100.0 100.0 1.0 100.0 22.0 33.0
7 12.0 33.0 100.0 22.0 100.0 100.0 100.0 1.0 22.0 33.0
8 27.0 52.0 22.0 40.0 35.0 22.0 22.0 22.0 1.0 34.0
9 64.0 44.0 33.0 30.0 47.0 33.0 33.0 33.0 34.0 1.0
which seems correct to me, as fuzz.partial_ratio("wonderful end ever seen you", "wonderful")
returns 100
(as a partial match is already considered a score of 100).
For consistency reasons you could change
t += t.to_numpy().T + np.diag(np.ones(t.shape[0]))
to
t += t.to_numpy().T + np.diag(np.ones(t.shape[0])) * 100
as all elements should perfectly match themselves. So when you said
But my output says that add matches with gwe and this is not true.
this would be true in the sense that fuzz.partial_ratio()
, you might want to consider using fuzz.ratio()
instead. Also, there might be an error when converting t
to the new Sim
column, but there seems to be no code in the provided example.
Also, as some comments suggested, sometimes it is helpful to restructure your code, so that it is easier for people to help you. Here is an example of how this could look like:
import re
import pandas as pd
from fuzzywuzzy import fuzz
data = """
13 fsad amazing ... fsd
14 fdsdf best sport everand the gane of the year❤️❤️❤️❤️... fdsfdgte3e
18 gsd wonderful fast
21 dfsfs i love this its incredible ... reds
23 gwe wonderful end ever seen you ... add
261 add wonderful gwe
261 add wonderful gsd
261 add wonderful fdsdf
267 fdsfdgte3e best match ever its a masterpiece fdsdf
277 hgdfgre terrible destroys everything ... tm28
"""
rows = data.strip().split('\n')
records = [[element for element in re.split(r' {2,}', row) if element != ''] for row in rows]
df = pd.DataFrame.from_records(records, columns=['RowNumber', 'ID', 'Text', 'IncorrectSim'], index='RowNumber')
df = df.drop('IncorrectSim', axis=1)
df = df.drop_duplicates(subset=["ID", "Text"]) # Assuming that there is no point in keeping duplicate rows
df = df.set_index('ID') # Assuming that the "ID" column holds a unique ID
comparison_df = df.copy()
comparison_df['Text'] = comparison_df["Text"].str.lower()
comparison_df['Tmp'] = 1
# This gives us all possible row combinations
comparison_df = comparison_df.reset_index().merge(comparison_df.reset_index(), on='Tmp').drop('Tmp', axis=1)
comparison_df = comparison_df[comparison_df['ID_x'] != comparison_df['ID_y']] # We only want rows that do not match itself
comparison_df['matchScore'] = comparison_df.apply(lambda row: fuzz.partial_ratio(row['Text_x'], row['Text_y']), axis=1)
comparison_df = comparison_df[comparison_df['matchScore'] == 100] # only keep perfect matches
comparison_df = comparison_df[['ID_x', 'ID_y']].rename(columns={'ID_x': 'ID', 'ID_y': 'Sim'}).set_index('ID') # Cleanup
result = df.join(comparison_df, how='left').fillna('')
print(result.to_string())
gives:
Text Sim
ID
add wonderful gsd
add wonderful gwe
dfsfs i love this its incredible ...
fdsdf best sport everand the gane of the year❤️❤️❤️❤...
fdsfdgte3e best match ever its a masterpiece
fsad amazing ...
gsd wonderful gwe
gsd wonderful add
gwe wonderful end ever seen you ... gsd
gwe wonderful end ever seen you ... add
hgdfgre terrible destroys everything ...