Search code examples
rmatrixfuzzy-logic

fuzzy comparing names in R - how to find highest possible sum in a matrix (with boundary conditions)


I have a matrix with similarity scores that looks like this:

comparison matrix

I need to find the highest possible sum of scores in this matrix. The sums have to fulfill a condition though:

If a number has been used for the sum, no numbers of its row or column or any previous rows or columns can be used for the sum anymore. This is because the order of the names matters.

I can start at any number, but all the values to the left and above that value will then be disqualified for the rest of that sum, as well as the values in the same row and column.

The highest possible sum with this system is 130 (10 + 100 + 10 + 10). That's the number i want in the end.

My strategy at the moment is to calculate all possible sums, and then simply select the highest. But how do i code the condition that i described above? Does it make sense?

Here are more examples of allowed (green) and not allowed (red) sums: enter image description here enter image description here enter image description here enter image description here

Another example of a valid sum:

  1. I start at the upper left corner. I choose 10. I can not add the 12 or the 11 that are in the same column anymore.
  2. I choose one of the remaining numbers. 12. I can no longer choose 100, 11, 10, or 25 from that column and row (and the previous columns and rows). I can now only choose 22 or 10 for the last number to add.
  3. If i pick 22, my total sum is 44. If i pick 10, my total sum 32.

The reason i'm using this system is because i'm trying to create an algorithm that compares full names of people and assigns it a probability that they're the same person - purely based on name information.

My current code looks like this:

library(tidyverse)
library(stringdist)

string.compare <- function(Var1, Var2){
  
  string1 <- Var1 %>% tolower() %>% trimws() %>% str_replace_all(pattern = "[[:punct:]]", replacement = "")  %>% strsplit(" ") %>% unlist()
  string2 <- Var2 %>% tolower() %>% trimws() %>% str_replace_all(pattern = "[[:punct:]]", replacement = "")  %>% strsplit(" ") %>% unlist()
  
  compare <- array(NA, dim = c(length(string1), 
                               length(string2)), dimnames = list(string1, 
                                                                 string2))
  compare[] <- do.call(mapply, 
                      c(list(FUN = string.score),
                        expand.grid(dimnames(compare), stringsAsFactors = FALSE)))
  
  sums <- func_calc_sums(compare)  # This is where is need help. How to write this function?
  
  output(max(sums))
}

string.score <- function(Var1, Var2){
  phonetic.weight <- 50 # this is an important variable. it determines the weight of the phonetic comparison. 100 = no weight, 0 = phonetic is all that matters.
  
  if(is.null(Var1) | is.null(Var2) | is.na(Var1) | is.na(Var2) | Var1 == "" | Var2 == ""){ # if one of the entries is empty, score 0
    return(0)
  } else if(Var1 == substr(Var2, 1, 1)){ # if Var1 is an abbreviation of Var2, score 10
    return(10)
  } else if(nchar(Var1) == 1){ # if Var1 is an abbreviation but not of Var2, score 0
    return(0)
  } else if(Var2 == substr(Var1, 1, 1)){ # if Var2 is an abbreviation of Var1, score 10
    return(10)
  } else if(phonetic(Var1) == phonetic(Var2)){ # If Var1 and Var2 are phonetically similar, give score based on stringdist
    return(round(100 - (phonetic.weight * stringdist(Var1, Var2, method = "osa") / nchar(Var1)), 0))
  } else {  # If Var1 and Var2 are not phonetically similar, give a score based on stringdist but lower
    return(round(100 - (100 * stringdist(Var1, Var2, method = "osa") / nchar(Var1)), 0))
  }
}

If you enter for example Var1 <- " a. michelle hernandes s. " and Var2 <- " Alexa michelle h. sanchez" and then run it through the function string.compare (the function is not finished, you'll have to execute the code line by line) it will first clean up the strings, and then split them into separate words.

Those words get assigned as rownames and colnames of a matrix, over which a scoring system is run string.score. Then you end up with the matrix at the beginning of this post.


Solution

  • I have two suggestions, that might be helpful:

    1. there is an implementation of the Needleman-Wunsch algorithm, which was mentioned by Roman Cheplyaka, in R on github. You can find it here: https://gist.github.com/juliuskittler/ed53696ac1e590b413aac2dddf0457f6
    2. You could try to solve the problem recursively using for example the maximum path sum function described here: https://lucidmanager.org/data-science/project-euler-18/

    You'd have to explicitly insert the constraints you mentioned to block these paths from being eligible, I think. Here's a run of the default function on your dataset:

    testmat <- matrix(data = c(10, 0, 0 , 0, 12, 100, 12, 25, 11, 11, 10, 22, 0, 0,
                               0, 10),
                      ncol = 4, 
                      nrow = 4,
                      byrow = T)
    
    path.sum <- function(triangle) {
      for (rij in nrow(triangle):2) {
        for (kol in 1:(ncol(triangle)-1)) {
          triangle[rij - 1,kol] <- max(triangle[rij,kol:(kol + 1)]) + triangle[rij - 1, kol]
        }
        triangle[rij,] <- NA
      }
      return(max(triangle, na.rm = TRUE))
    }
    
    > path.sum(testmat)
    [1] 130