Search code examples
rfor-loopstringdist

Remove for loop from stringdist algorithm in R


I've made an algorithm to determine scores of matching strings from 2 dataframes in R. It will search for each row in test_ech the matching rows which their score is above 0.75 in test_data (based on the matching of 3 columns from each data frame).

Well, my code works perfectly with small data frame but I'm dealing with dataframes of 12m rows and the process will take at least 5 days to be done. So I think that if I discard "for loops" It will work but I really don't know how to do it. (and if there's extra changes that I need to do to lighten the process)

Thanks.

#score function :

library(stringdist)

score <- function(i,j) 
{  
s_n<-stringsim(test_ech[j,3],test_data[i,5],method = "jw",p=0.15)
s_v<-stringsim(test_ech[j,5],test_data[i,4],method = "jw",p=0.15)
s_c<-stringsim(test_ech[j,4],test_data[i,3],method = "jw",p=0.15)

 return(s_n*0.6+s_v*0.25+s_c*0.15)
}

#initialize result data frame :

resultat<-data.frame(nom_AS400=character(),ville_AS400=character(),cp_AS400=character(),                nom_SIRENE=character(),ville_SIRENE=character(),cp_SIRENE=character(),score=double())

#algo textmining :

system.time(for (j in 1:nrow(test_ech)) {

  for (i in 1:nrow(test_data)) {

    x<-score(i,j)

    if (x>0.75) {

ligne<-data.frame(nom_AS400=test_ech[j,3],
       ville_AS400=test_ech[j,5],
       cp_AS400=test_ech[j,4],
       nom_SIRENE=test_data[i,5],
       ville_SIRENE=test_data[i,4],
       cp_SIRENE=test_data[i,3],
       score=x)

      resultat<-rbind(resultat,ligne)      
    }  
  } 
})

test_ech : 65k rows and test_data : 12m rows

#test_ech (5 rows)
structure(list(societe_code = c("01", "01", "01", "01", "01"), 
    client_code = c("00048I", "00059Z", "00070Q", "00080W", "00131L"
    ), client_lib = c("CFA VAUBAN", "ALLRIM SA", "ATS CULLIGAN", 
    "AHSSEA", "ETS BRUNEAU P"), client_cp = c("25001", "25401", 
    "25480", "70002", "94700"), client_ville = c("BESANCON CEDEX", 
    "AUDINCOURT CEDEX", "ECOLE VALENTIN", "VESOUL CEDEX", "MAISONS ALFORT"
    )))

#test_data (5 rows)
structure(list(siren = c("005450093", "005450095", "005541552", 
"005580501", "005620117"), siret = c("00545009300033", "00545009300041", 
"00554155200039", "00558050100012", "00562011700019"), codePostalEtablissement = c("04800", 
"04802", "04260", "44600", "80100"), libelleCommuneEtablissement = c("GREOUX LES BAINS", 
"BAINS", "ALLOS", "SAINT NAZAIRE", "ABBEVILLE"), ref = c("PASSIONNEMENT GLAMOUR", 
"GLAMOUR", "LE SYMPA SNACK", "STEF", "DUBOIS")))

the expected output is a dataframe with the 3 reference columns from test_ech with the 3 matching columns from test_data and the score which should be >0.75

output link


Solution

  • I'm not sure if this completely solves your problem given the dimensions of your original data, but you can reduce your time substantially by doing it over one for loop instead of two. You can do this because the stringsim function accepts a single character object on one side and a vector on the other.

        score_2 <- function(j) 
    {  
      s_n <- stringsim(test_ech[[j,3]], test_data[[5]], method = "jw", p = 0.15)
      s_v <- stringsim(test_ech[[j,5]], test_data[[4]], method = "jw", p = 0.15)
      s_c <- stringsim(test_ech[[j,4]], test_data[[3]], method = "jw", p = 0.15)
    
      return(s_n * 0.6 + s_v * 0.25 + s_c * 0.15)
    }
    
        stringsim (test_ech[,3], test_data[,5])
    
        resultat<-data.frame(nom_AS400=character(),ville_AS400=character(),cp_AS400=character(),                nom_SIRENE=character(),ville_SIRENE=character(),cp_SIRENE=character(),score=double())
    
        for (j in 1:nrow(test_ech)) {
    
          x <- score_2(j)
    
          x_75 = which(x > 0.75)
    
          if(length(x_75) > 0){
            for(i in x_75){
    
             ligne<-data.frame(nom_AS400=test_ech[[j,3]],
                               ville_AS400=test_ech[[j,5]],
                               cp_AS400=test_ech[[j,4]],
                               nom_SIRENE=test_data[[i,5]],
                               ville_SIRENE=test_data[[i,4]],
                               cp_SIRENE = test_data[[i,3]],                       
                               score = x[i])
    
          resultat<-rbind(resultat,ligne)
    
        }
       }
      }
    

    Your function, repeating both your test objects 60 times:

      usuário   sistema decorrido 
         9.59      1.43     11.12 
    

    This function, repeating both test objects 60 times:

      usuário   sistema decorrido 
         0.21      0.08      0.18 
    

    Quite a bit faster :)

    (note: there is stringdistmatrix which accepts vectors on both sides and returns a matrix, but sadly there is no stringsimmatrix. If you can map out the difference between stringdist and stringsim, running stringdistmatrix and adjusting it would probably be even faster).