I've made an algorithm to determine scores of matching strings from 2 dataframes in R. It will search for each row in test_ech the matching rows which their score is above 0.75 in test_data (based on the matching of 3 columns from each data frame).
Well, my code works perfectly with small data frame but I'm dealing with dataframes of 12m rows and the process will take at least 5 days to be done. So I think that if I discard "for loops" It will work but I really don't know how to do it. (and if there's extra changes that I need to do to lighten the process)
Thanks.
#score function :
library(stringdist)
score <- function(i,j)
{
s_n<-stringsim(test_ech[j,3],test_data[i,5],method = "jw",p=0.15)
s_v<-stringsim(test_ech[j,5],test_data[i,4],method = "jw",p=0.15)
s_c<-stringsim(test_ech[j,4],test_data[i,3],method = "jw",p=0.15)
return(s_n*0.6+s_v*0.25+s_c*0.15)
}
#initialize result data frame :
resultat<-data.frame(nom_AS400=character(),ville_AS400=character(),cp_AS400=character(), nom_SIRENE=character(),ville_SIRENE=character(),cp_SIRENE=character(),score=double())
#algo textmining :
system.time(for (j in 1:nrow(test_ech)) {
for (i in 1:nrow(test_data)) {
x<-score(i,j)
if (x>0.75) {
ligne<-data.frame(nom_AS400=test_ech[j,3],
ville_AS400=test_ech[j,5],
cp_AS400=test_ech[j,4],
nom_SIRENE=test_data[i,5],
ville_SIRENE=test_data[i,4],
cp_SIRENE=test_data[i,3],
score=x)
resultat<-rbind(resultat,ligne)
}
}
})
test_ech : 65k rows and test_data : 12m rows
#test_ech (5 rows)
structure(list(societe_code = c("01", "01", "01", "01", "01"),
client_code = c("00048I", "00059Z", "00070Q", "00080W", "00131L"
), client_lib = c("CFA VAUBAN", "ALLRIM SA", "ATS CULLIGAN",
"AHSSEA", "ETS BRUNEAU P"), client_cp = c("25001", "25401",
"25480", "70002", "94700"), client_ville = c("BESANCON CEDEX",
"AUDINCOURT CEDEX", "ECOLE VALENTIN", "VESOUL CEDEX", "MAISONS ALFORT"
)))
#test_data (5 rows)
structure(list(siren = c("005450093", "005450095", "005541552",
"005580501", "005620117"), siret = c("00545009300033", "00545009300041",
"00554155200039", "00558050100012", "00562011700019"), codePostalEtablissement = c("04800",
"04802", "04260", "44600", "80100"), libelleCommuneEtablissement = c("GREOUX LES BAINS",
"BAINS", "ALLOS", "SAINT NAZAIRE", "ABBEVILLE"), ref = c("PASSIONNEMENT GLAMOUR",
"GLAMOUR", "LE SYMPA SNACK", "STEF", "DUBOIS")))
the expected output is a dataframe with the 3 reference columns from test_ech with the 3 matching columns from test_data and the score which should be >0.75
I'm not sure if this completely solves your problem given the dimensions of your original data, but you can reduce your time substantially by doing it over one for
loop instead of two. You can do this because the stringsim
function accepts a single character object on one side and a vector on the other.
score_2 <- function(j)
{
s_n <- stringsim(test_ech[[j,3]], test_data[[5]], method = "jw", p = 0.15)
s_v <- stringsim(test_ech[[j,5]], test_data[[4]], method = "jw", p = 0.15)
s_c <- stringsim(test_ech[[j,4]], test_data[[3]], method = "jw", p = 0.15)
return(s_n * 0.6 + s_v * 0.25 + s_c * 0.15)
}
stringsim (test_ech[,3], test_data[,5])
resultat<-data.frame(nom_AS400=character(),ville_AS400=character(),cp_AS400=character(), nom_SIRENE=character(),ville_SIRENE=character(),cp_SIRENE=character(),score=double())
for (j in 1:nrow(test_ech)) {
x <- score_2(j)
x_75 = which(x > 0.75)
if(length(x_75) > 0){
for(i in x_75){
ligne<-data.frame(nom_AS400=test_ech[[j,3]],
ville_AS400=test_ech[[j,5]],
cp_AS400=test_ech[[j,4]],
nom_SIRENE=test_data[[i,5]],
ville_SIRENE=test_data[[i,4]],
cp_SIRENE = test_data[[i,3]],
score = x[i])
resultat<-rbind(resultat,ligne)
}
}
}
Your function, repeating both your test objects 60 times:
usuário sistema decorrido
9.59 1.43 11.12
This function, repeating both test objects 60 times:
usuário sistema decorrido
0.21 0.08 0.18
Quite a bit faster :)
(note: there is stringdistmatrix
which accepts vectors on both sides and returns a matrix, but sadly there is no stringsimmatrix
. If you can map out the difference between stringdist
and stringsim
, running stringdistmatrix
and adjusting it would probably be even faster).