I've got a dataframe a
and a vector b
(derived from another data frame). Now I want to find all occurences from vector b
in a
.
However, unfortunately vector b
sometimes misses a leading character.
a <- structure(list(GSN_IDENTITY_CODE = c("01234567", "65461341", "NH1497", "ZH0080", "TP5146", "TP5146"), PIG_ID = c("129287133", "120561144", "119265685", "121883198", "109371743", "109371743" ), SEX_CODE = c("Z", "Z", "Z", "Z", "B", "B")), .Names = c("GSN_IDENTITY_CODE", "PIG_ID", "SEX_CODE"), row.names = c(NA, 6L), class = "data.frame")
> a
# IDENTITY_CODE PIG_ID SEX_CODE
#1 01234567 129287133 Z
#2 65461341 120561144 Z
#3 NH1497 119265685 Z
#4 ZH0080 121883198 Z
#5 TP5146 109371743 B
#6 TP5146 109371743 B
b <- c("65461341", "1234567", "ZH0080", "TP5146")
My expected output would be this:
a
# GSN_IDENTITY_CODE PIG_ID SEX_CODE
#1 01234567 129287133 Z
#2 65461341 120561144 Z
#4 ZH0080 121883198 Z
#5 TP5146 109371743 B
When first removing the duplicates it solves one problem, however I still need a way to select all rows containing the values from vector b
whereas I need more rows:
a <- a[!duplicated(a$GSN_IDENTITY_CODE),]
Unfortunately I cannot use %in%
because it will bring in duplicates and miss out on the first line because it does not accept regex':
> a[a$GSN_IDENTITY_CODE %in% b,]
# GSN_IDENTITY_CODE PIG_ID SEX_CODE
#2 65461341 120561144 Z
#4 ZH0080 121883198 Z
#5 TP5146 109371743 B
#6 TP5146 109371743 B
Using data.table
's %like%
would work only for the first string in vector b
library(data.table)
> setDT(a)
> a[a$GSN_IDENTITY_CODE %like% b,]
# GSN_IDENTITY_CODE PIG_ID SEX_CODE
#1: 65461341 120561144 Z
Warning message:
In grepl(pattern, vector) :
argument 'pattern' has length > 1 and only the first element will be used
Is there a function in R that supports my needs here?
@Frank's attempt yields the following error:
a <- structure(list(GSN_IDENTITY_CODE = c("01234567", "65461341", "NH1497", "ZH0080", "TP5146", "TP5146"), PIG_ID = c("129287133", "120561144", "119265685", "121883198", "109371743", "109371743" ), SEX_CODE = c("Z", "Z", "Z", "Z", "B", "B")), .Names = c("GSN_IDENTITY_CODE", "PIG_ID", "SEX_CODE"), row.names = c(NA, 6L), class = "data.frame")
b <- c("65461341", "1234567", "ZH0080", "TP5146")
> a[.(b), on="GSN_IDENTITY_CODE", nomatch=FALSE, mult="first"]
Error in `[.data.frame`(a, .(b), on = "GSN_IDENTITY_CODE", nomatch = FALSE, :
unused arguments (on = "GSN_IDENTITY_CODE", nomatch = FALSE, mult = "first")
> setDT(a)
> a[.(b), on="GSN_IDENTITY_CODE", nomatch=FALSE, mult="first"]
Error in bmerge(i, x, leftcols, rightcols, io, xo, roll, rollends, nomatch, :
x.'GSN_IDENTITY_CODE' is a character column being joined to i.'NA' which is type 'NULL'. Character columns must join to factor or character columns.
You can do something like this for close matches if the extra character might occur anywhere in the string:
library(stringdist)
library(purrr)
a$closest_match <- map(a$GSN_IDENTITY_CODE, ~stringdist(., b, method = "lv")) %>%
map_dbl(min)
a[a$closest_match < 2, ]
If the extra character is always at the beginning, I would do something like this:
library(stringr)
a$stripped_code <- str_replace(a$GSN_IDENTITY_CODE,"^\\d", "")
a$keep <- a$GSN_IDENTITY_CODE %in% b | a$stripped_code %in% b
a[a$keep, ]