I have one list of vectors of people's names, where each vector just has the first and last name and I have another list of vectors, where each vector has the first, middle, last names. I need to match the two lists to find people who are included in both lists. Because the names are not in order (some vectors have the first name as the first value, while others have the last name as the first value), I would like to match the two vectors by finding which vector in the second list (full name) contains all the values of a vector in the first list (first and last names only).
What I have done so far:
#reproducible example
first_last_names_list <- list(c("boy", "boy"),
c("bob", "orengo"),
c("kalonzo", "musyoka"),
c("anami", "lisamula"))
full_names_list <- list(c("boy", "juma", "boy"),
c("stephen", "kalonzo", "musyoka"),
c("james", "bob", "orengo"),
c("lisamula", "silverse", "anami"))
First, I tried to make a function that checks whether one vector is contained in another vector (heavily based on the code from here).
my_contain <- function(values,x){
tx <- table(x)
tv <- table(values)
z <- tv[names(tx)] - tx
if(all(z >= 0 & !is.na(z))){
paste(x, collapse = " ")
}
}
#value would be the longer vector (from full_name_list)
#and x would be the shorter vector(from first_last_name_list)
Then, I tried to put this function within sapply() so that I can work with lists and that's where I got stuck. I can get it to see whether one vector is contained within a list of vectors, but I'm not sure how to check all the vectors in one list and see if it is contained within any of the vectors from a second list.
#testing with the first vector from first_last_names_list.
#Need to make it run through all the vectors from first_last_names_list.
sapply(1:length(full_names_list),
function(i) any(my_contain(full_names_list[[i]],
first_last_names_list[[1]]) ==
paste(first_last_names_list[[1]], collapse = " ")))
#[1] TRUE FALSE FALSE FALSE
Lastly- although it might be too much to ask in one question- if anyone could give me any pointers on how to incorporate agrep() for fuzzy matching to account for typos in the names, that would be great! If not, that's okay too, since I want to get at least the matching part right first.
Since you are dealing with lists
it would be better to collapse them into vectors to be easy to deal with regular expressions. But you just arrange them in ascending order. In that case you can easily match them:
lst=sapply(first_last_names_list,function(x)paste0(sort(x),collapse=" "))
lst1=gsub("\\s|$",".*",lst)
lst2=sapply(full_names_list,function(x)paste(sort(x),collapse=" "))
(lst3 = Vectorize(grep)(lst1,list(lst2),value=T,ignore.case=T))
boy.*boy.* bob.*orengo.* kalonzo.*musyoka.* anami.*lisamula.*
"boy boy juma" "bob james orengo" "kalonzo musyoka stephen" "anami lisamula silverse"
Now if you want to link first_name_last_name_list
and full_name_list
then:
setNames(full_names_list[ match(lst3,lst2)],sapply(first_last_names_list[grep(paste0(names(lst3),collapse = "|"),lst1)],paste,collapse=" "))
$`boy boy`
[1] "boy" "juma" "boy"
$`bob orengo`
[1] "james" "bob" "orengo"
$`kalonzo musyoka`
[1] "stephen" "kalonzo" "musyoka"
$`anami lisamula`
[1] "lisamula" "silverse" "anami"
where the names are from first_last_list and the elements are full_name_list. It would be great for you to deal with character vectors rather than lists: