I am here to look for input for a data manipulation problem related to natural language processing.
To make life easier, I am using a mock dataset posted several years ago from How to group text data based on document similarity?.
import pandas as pd
from difflib import SequenceMatcher
df = pd.DataFrame({'Questions': ['What are you doing?','What are you doing tonight?','What are you doing now?','What is your name?','What is your nick name?','What is your full name?','Shall we meet?',
'How are you doing?' ]})
def similarity_score(s1, s2):
return SequenceMatcher(None, s1, s2).ratio()
def similarity(x,df):
sim_score = []
for i in df['Questions']:
sim_score.append(similarity_score(x,i))
return sim_score
df['similarity'] = df['Questions'].apply(lambda x : similarity(x, df)).astype(str)
print(df)
The output is as following
Questions \
0 What are you doing?
1 What are you doing tonight?
2 What are you doing now?
3 What is your name?
4 What is your nick name?
5 What is your full name?
6 Shall we meet?
7 How are you doing?
similarity
0 [1.0, 0.8260869565217391, 0.9047619047619048, ...
1 [0.8260869565217391, 1.0, 0.84, 0.533333333333...
2 [0.9047619047619048, 0.84, 1.0, 0.585365853658...
3 [0.6486486486486487, 0.5333333333333333, 0.585...
4 [0.5714285714285714, 0.52, 0.5217391304347826,...
5 [0.5714285714285714, 0.52, 0.5652173913043478,...
6 [0.36363636363636365, 0.34146341463414637, 0.3...
7 [0.8108108108108109, 0.6666666666666666, 0.731...
The logic is that I go through each row in the data frame to compare it to all over rows (including itself) in order to compute their similarity. I then store the similarity score as a list in another column called "similarity".
Next, I want to categorize the questions in the first column. If the similarity score > 0.9, then those rows should be assigned to the same group. How can I achieve this?
A solution is to iterate row-wise over your similarity scores, create a binary mask based on some threshold, and then use the binary mask to only extract those questions who meet the threshold.
Note that this solution presumes that the "groups" you desire are the questions themselves (i.e. for each question, you want a list of similar questions associated with it). I made up similarity scores for the rest of the array to create this minimal example.
import pandas as pd
orig_data = {
"Questions": [
"What are you doing?",
"What are you doing tonight?",
"What are you doing now?",
"What is your name?",
"What is your nick name?",
"What is your full name?",
"Shall we meet?",
"How are you doing?",
],
"similarity": [
[1.0, 0.826, 0.905, 0.234, 0.544, 0.673, 0.411, 0.45],
[0.826, 1.0, 0.84, 0.533, 0.444, 0.525, 0.641, 0.62],
[0.905, 0.84, 1.0, 0.585, 0.861, 0.685, 0.455, 0.65],
[0.649, 0.533, 0.585, 1.0, 0.901, 0.902, 0.642, 0.234],
[0.571, 0.52, 0.522, 0.901, 1.0, 0.905, 0.753, 0.786],
[0.571, 0.52, 0.565, 0.902, 0.903, 1.0, 0.123, 0.586],
[0.364, 0.341, 0.3, 0.674, 0.584, 0.421, 1.0, 0.544],
[0.811, 0.667, 0.731, 0.345, 0.764, 0.242, 0.55, 1.0],
],
}
df = pd.DataFrame(orig_data)
results = []
for idx, sim_row in enumerate(df["similarity"]):
bin_mask = [True if score > 0.9 else False for score in sim_row]
curr_q = df["Questions"][idx]
sim_quests = [q for q, b in zip(df["Questions"], bin_mask) if b and q != curr_q]
results.append(sim_quests)
df["similar-questions"] = results
print(df)
Questions ... similar-questions
0 What are you doing? ... [What are you doing now?]
1 What are you doing tonight? ... []
2 What are you doing now? ... [What are you doing?]
3 What is your name? ... [What is your nick name?, What is your full na...
4 What is your nick name? ... [What is your name?, What is your full name?]
5 What is your full name? ... [What is your name?, What is your nick name?]
6 Shall we meet? ... []
7 How are you doing? ... []