Search code examples
fuzzywuzzy

Fuzzy Wuzzy Not Comparing Every String Against Every Other String in String_List


I'm hoping to use fuzzy wuzzy to compare all strings in a list against each other, but it looks not every string is being compared against one another in the list. Here's what I've tried:

matrix = [(x,) + i for item in output for x in item for i in process.extract(x, item, scorer=fuzz.partial_ratio)]

A.K.A

for item in output:
     for x in item:
          for i in process.extract(x,item,scorer=fuzz.partial_ratio):

Here is one item for which each string is being checked against all other strings for similarity:

[['Java',
  'JavaVersio',
  'Control',
  'GitTools',
  'Sketch',
  'IVision',
  'Zepli',
  'Go',
  'GoAutomatedTesting',
  'AutomatedTestingProjectManagement',
  'AgileMethodology',
  'ScrumEnglish',
  'Writte',
  'English',
  'Spoke',
  'EnglishMobile',
  'ReactNative',
  'Ionic',
  'Android',
  'Kotlin',
  'ObjectiveC'],
['HTML',
  'HTMLJava',
  'JavaJavaScript',
  'JavaScript',
  'React',
  'NodejsVersio',
  'Control',
  'GitManualQA',...

So there should be 210 comparisons made ((k * (k-1)/2)), but here you're able to see that the beginning of the next item is being compared at index 105:

matrix_df = pd.DataFrame(matrix, columns=["word", "match", "score"])
matrix_df[100:150]

word    match   score
100     ObjectiveC  ObjectiveC  100
101     ObjectiveC  ReactNative     57
102     ObjectiveC  AutomatedTestingProjectManagement   45
103     ObjectiveC  Ionic   40
104     ObjectiveC  Sketch  38
105     HTML    HTML    100
106     HTML    HTMLJava    90
107     HTML    Control     45
108     HTML    GitManualQA     45
109     HTML    PostgreSQLManagementHosting     45
110     HTMLJava    HTMLJava    100
111     HTMLJava    HTML    90
112     HTMLJava    JavaJavaScript  45

Why would this be happening and how would I fix it???

Thank you!


Solution

  • The function process.extract in fuzzywuzzy has the following arguments:

    def extract(query, choices, processor=default_processor, scorer=default_scorer, limit=5):
    

    here limit is set to 5 by default which means the function will only return a list with up to the 5 best matches within choices (less when choices does not has 5 elements). So to get the scores for all elements instead you should pass the argument limit=None.

    matrix = [
      (x,) + i for item in output
      for x in item
      for i in process.extract(x, item, scorer=fuzz.partial_ratio, limit=None)
    ]