Search code examples
regexrdataframedata.tablesql-like

Find partial occurences in data frame based on a vector


I've got a dataframe a and a vector b (derived from another data frame). Now I want to find all occurences from vector b in a.
However, unfortunately vector b sometimes misses a leading character.

a <- structure(list(GSN_IDENTITY_CODE = c("01234567", "65461341", "NH1497", "ZH0080", "TP5146", "TP5146"), PIG_ID = c("129287133", "120561144", "119265685", "121883198", "109371743", "109371743" ), SEX_CODE = c("Z", "Z", "Z", "Z", "B", "B")), .Names = c("GSN_IDENTITY_CODE", "PIG_ID", "SEX_CODE"), row.names = c(NA, 6L), class = "data.frame")

> a
#      IDENTITY_CODE    PIG_ID SEX_CODE
#1          01234567 129287133        Z
#2          65461341 120561144        Z
#3            NH1497 119265685        Z
#4            ZH0080 121883198        Z
#5            TP5146 109371743        B
#6            TP5146 109371743        B

b <- c("65461341", "1234567", "ZH0080", "TP5146")

My expected output would be this:

a
#  GSN_IDENTITY_CODE    PIG_ID SEX_CODE
#1          01234567 129287133        Z
#2          65461341 120561144        Z
#4            ZH0080 121883198        Z
#5            TP5146 109371743        B

When first removing the duplicates it solves one problem, however I still need a way to select all rows containing the values from vector b whereas I need more rows:

a <- a[!duplicated(a$GSN_IDENTITY_CODE),]

Unfortunately I cannot use %in% because it will bring in duplicates and miss out on the first line because it does not accept regex':

> a[a$GSN_IDENTITY_CODE %in% b,]
#  GSN_IDENTITY_CODE    PIG_ID SEX_CODE
#2          65461341 120561144        Z
#4            ZH0080 121883198        Z
#5            TP5146 109371743        B
#6            TP5146 109371743        B

Using data.table's %like% would work only for the first string in vector b

library(data.table)
> setDT(a)
> a[a$GSN_IDENTITY_CODE %like% b,]
#   GSN_IDENTITY_CODE    PIG_ID SEX_CODE
#1:          65461341 120561144        Z
Warning message:
In grepl(pattern, vector) :
  argument 'pattern' has length > 1 and only the first element will be used

Is there a function in R that supports my needs here?


@Frank's attempt yields the following error:

a <- structure(list(GSN_IDENTITY_CODE = c("01234567", "65461341", "NH1497", "ZH0080", "TP5146", "TP5146"), PIG_ID = c("129287133", "120561144", "119265685", "121883198", "109371743", "109371743" ), SEX_CODE = c("Z", "Z", "Z", "Z", "B", "B")), .Names = c("GSN_IDENTITY_CODE", "PIG_ID", "SEX_CODE"), row.names = c(NA, 6L), class = "data.frame")

b <- c("65461341", "1234567", "ZH0080", "TP5146")

> a[.(b), on="GSN_IDENTITY_CODE", nomatch=FALSE, mult="first"]
Error in `[.data.frame`(a, .(b), on = "GSN_IDENTITY_CODE", nomatch = FALSE,  : 
  unused arguments (on = "GSN_IDENTITY_CODE", nomatch = FALSE, mult = "first")
> setDT(a)
> a[.(b), on="GSN_IDENTITY_CODE", nomatch=FALSE, mult="first"]
Error in bmerge(i, x, leftcols, rightcols, io, xo, roll, rollends, nomatch,  : 
  x.'GSN_IDENTITY_CODE' is a character column being joined to i.'NA' which is type 'NULL'. Character columns must join to factor or character columns.

Solution

  • You can do something like this for close matches if the extra character might occur anywhere in the string:

    library(stringdist)
    library(purrr)
    
    
    a$closest_match <- map(a$GSN_IDENTITY_CODE, ~stringdist(., b, method = "lv")) %>% 
      map_dbl(min)
    a[a$closest_match < 2, ]
    

    If the extra character is always at the beginning, I would do something like this:

    library(stringr)
    
    a$stripped_code <- str_replace(a$GSN_IDENTITY_CODE,"^\\d", "")
    
    a$keep <- a$GSN_IDENTITY_CODE %in% b | a$stripped_code %in% b
    a[a$keep, ]