I'm hoping to use fuzzy wuzzy to compare all strings in a list against each other, but it looks not every string is being compared against one another in the list. Here's what I've tried:
matrix = [(x,) + i for item in output for x in item for i in process.extract(x, item, scorer=fuzz.partial_ratio)]
A.K.A
for item in output:
for x in item:
for i in process.extract(x,item,scorer=fuzz.partial_ratio):
Here is one item for which each string is being checked against all other strings for similarity:
[['Java',
'JavaVersio',
'Control',
'GitTools',
'Sketch',
'IVision',
'Zepli',
'Go',
'GoAutomatedTesting',
'AutomatedTestingProjectManagement',
'AgileMethodology',
'ScrumEnglish',
'Writte',
'English',
'Spoke',
'EnglishMobile',
'ReactNative',
'Ionic',
'Android',
'Kotlin',
'ObjectiveC'],
['HTML',
'HTMLJava',
'JavaJavaScript',
'JavaScript',
'React',
'NodejsVersio',
'Control',
'GitManualQA',...
So there should be 210 comparisons made ((k * (k-1)/2)), but here you're able to see that the beginning of the next item is being compared at index 105:
matrix_df = pd.DataFrame(matrix, columns=["word", "match", "score"])
matrix_df[100:150]
word match score
100 ObjectiveC ObjectiveC 100
101 ObjectiveC ReactNative 57
102 ObjectiveC AutomatedTestingProjectManagement 45
103 ObjectiveC Ionic 40
104 ObjectiveC Sketch 38
105 HTML HTML 100
106 HTML HTMLJava 90
107 HTML Control 45
108 HTML GitManualQA 45
109 HTML PostgreSQLManagementHosting 45
110 HTMLJava HTMLJava 100
111 HTMLJava HTML 90
112 HTMLJava JavaJavaScript 45
Why would this be happening and how would I fix it???
Thank you!
The function process.extract
in fuzzywuzzy has the following arguments:
def extract(query, choices, processor=default_processor, scorer=default_scorer, limit=5):
here limit is set to 5 by default which means the function will only return a list with up to the 5 best matches within choices (less when choices does not has 5 elements). So to get the scores for all elements instead you should pass the argument limit=None
.
matrix = [
(x,) + i for item in output
for x in item
for i in process.extract(x, item, scorer=fuzz.partial_ratio, limit=None)
]