Search code examples
rgrepl

Determine which elements of a vector partially match a second vector, and which elements don't (in R)


I have a vector A, which contains a list of genera, which I want to use to subset a second vector, B. I have successfully used grepl to extract anything from B that has a partial match to the genera in A. Below is a reproducible example of what I have done.

But now I would like to get a list of which genera in A matched with something in B, and which which genera did not. I.e. the "matched" list would contain Cortinarius and Russula, and the "unmatched" list would contain Laccaria and Inocybe. Any ideas on how to do this? In reality my vectors are very long, and the genus names in B are not all in the same position amongst the other info.

# create some dummy vectors
A <- c("Cortinarius","Laccaria","Inocybe","Russula")
B <- c("fafsdf_Cortinarius_sdfsdf","sdfsdf_Russula_sdfsdf_fdf","Tomentella_sdfsdf","sdfas_Sebacina","sdfsf_Clavulina_sdfdsf")

# extract the elements of B that have a partial match to anything in A.
new.B <- B[grepl(paste(A,collapse="|"), B)]

# But now how do I tell which elements of A were present in B, and which ones were not?

Solution

  • We could use lapply or sapply to loop over the patterns and then get a named output

    out <- setNames(lapply(A, function(x) grep(x, B, value = TRUE)), A)
    

    THen, it is easier to check the ones returning empty elements

    > out[lengths(out) > 0]
    $Cortinarius
    [1] "fafsdf_Cortinarius_sdfsdf"
    
    $Russula
    [1] "sdfsdf_Russula_sdfsdf_fdf"
    
    > out[lengths(out) == 0]
    $Laccaria
    character(0)
    
    $Inocybe
    character(0)
    

    and get the names of that

    > names(out[lengths(out) > 0])
    [1] "Cortinarius" "Russula"    
    > names(out[lengths(out) == 0])
    [1] "Laccaria" "Inocybe"