Search code examples
rarrayssubsetmultiple-columns

Text subsetting of a data frame in R


I have two vectors with given names as follows in R:

A <- data.frame(c("Nick", "Maria", "Liam", "Oliver", "Sophia", "james", "Lucas; Luc"))
B  <- data.frame(c("Liam", "Luc", "Evelyn; Eva", "James", "Harper", "Amelia"))

I want to compare the two vectors and create a vector C with the names of vector B that are not in the vector A. I want the code to ignore the capital letters, i.e. to recognise that James and james is the same and if the name appear as two names (given name and preferred name), e.g., Lucas; Luc, to recognise it as the same. In the end, the result must be

C <- data.frame(c("Evelyn; Eva", "Harper","Amelia")) 

Can someone help me?


Solution

  • Probably the ugliest code i did but it works.

    A <- str_to_title(c("Nick", "Maria", "Liam", "Oliver", "Sophia", "james", "Lucas; Luc"))
    B  <- str_to_title(c("Liam", "Luc", "Evelyn; Eva", "James", "Harper", "Amelia"))
    
    # Long version if you wish:
    nested <- tibble(given=str_extract(c(A,B),"^[^;]+"),
               preferred=str_extract(c(A,B),";\\s*([^;]+)") %>% str_extract("[a-zA-Z]+"),
               list=c(rep("A",length(A)),rep("B",length(B)))) %>% nest_by(list)
    A <- nested$data[[1]]
    B <- nested$data[[2]]
    unique_b <- B$given %in% A$given | B$given %in% A$preferred
    
    B %>% filter(given %in% B$given[!unique_b]) %>%
      mutate(c=ifelse(is.na(preferred),given,str_c(given,preferred,sep  = "; "))
    ) %>% pull(c)