Search code examples
rdata-cleaningdata-munging

Pre-Processing / Formatting Data


I have two vectors in R:

     list1 <- c("ABCDEF", "FEDCBA", "AA-BB-CCCC", "ABCDEFGH-IJK", "ZZZZ")
     
     list2 <- c("ABCDEF", "FEDCBA:XA",   
   "AA-BB-CCCC-01","AA-BB-CCCC-21:ABC", "ABCDEFGH-IJK-1X",   
   "AKDWXFE-XXY")

I'd like to compare the two lists -- with list1 being the 'correct' list. If an item in list1 does not appear in list2, then print out 'Add [item in list1]'; if item in list2 is not in list1, then print out 'delete [item in list 2]'. I would like to find partial matches. For example, list 1 has 'FEDCBA' and list2 has 'FEDCBA:XA" -- this would be an acceptable partial match....same with list 2 having AA-BB-CCCC-21:ABC while list1 has AA-BB-CCCC (this is also an acceptable partial match).


Solution

  • It looks like a homework to me, but OK, let us make it a teaching moment.

    First, let us find out which elements of list1 have matches in list2. We will use grepl for that, which returns a logical vector with one TRUE/FALSE value for each element of list2.

    library(tidyverse)
                                
    list1_has_match <- map_lgl(list1, ~ any(grepl(., list2)))
    msg <- sprintf("Add [%s]", list1[ !list1_has_match ])
    

    In the above code, I use map_lgl to run the any(grepl(...)) expression for each element of list1 and return a logical vector. Any element that has a FALSE value in that vector is not present in list2 and should be added.

    Next, we do the same – the other way around. However, we have still to use the elements of list1 as a pattern. This is why the next point gets a bit complicated. In each call within map_dfr, we are generating a named vector corresponding to one element of list1. However, since we use map_dfr, each of these vectors will be considered a row in a data frame. Thus, the columns of the result will correspond to the elemnts of list2.

    map1 <- map_dfr(list1, ~ set_names(grepl(., list2), list2))
    
    list2_has_match <- map_lgl(map1, any)
    msg <- c(msg, 
             sprintf("delete [%s]", list2[ !list2_has_match ]))
    

    And now print the messages

    cat(msg, sep="\n")