I have two big datasets, I would like to subset some columns in order to use the data. My problem is that the reference column for subsetting is not completely matching. So I would like to be able to match for the part of the strings that are the same.
Here a simpler example:
ref_df <- data.frame("reference" = c("swietenia macrophylla",
"azadirachta indica",
"cedrela odorata",
"ochroma pyramidale",
"tectona grandis",
"tamarindus indica",
"cariniana pyriformis",
"paquita quinata",
"albizia saman",
"enterolobium cyclocarpum",
"tapirira guianensis",
"dipteryx oleifera"),
"values" = c(rnorm(12)))
tofind_df <- c("swietenia macrophylla and try try",
"azadirachta indica",
"tamarindus indica (bla bla)",
"tara",
"bla bla (paquita quinata)",
"prosopis pallida",
"dipteryx oleifera")
So I try to keep all the values of ref_df that have a name that matches even partially in tofond_df, but it only matches if they are the same.
finale <- ref_df[ref_df$reference %in% tofind_df$names,]
I tried with grepl as well, but I couldn't find the solution.
My ideal finale should look like this:
reference values
1 swietenia macrophylla -0.459001383
2 azadirachta indica -0.430014486
3 tamarindus indica -0.541887328
4 paquita quinata -0.003572792
5 dipteryx oleifera -0.855659901
Please, think about two big df and not this easier situation.
We need to use sapply
to get the results from grepl
for every element
ref_df[sapply(ref_df$reference, function(x) any(grepl(x, tofind_df))),]
reference values
1 swietenia macrophylla 1.4482830
2 azadirachta indica 0.9037943
6 tamarindus indica -0.2994678
8 paquita quinata 0.4895183
12 dipteryx oleifera -1.1652528