Search code examples
rstringmatchingpartial

r function to conditionally match parcial strings


I have two big datasets, I would like to subset some columns in order to use the data. My problem is that the reference column for subsetting is not completely matching. So I would like to be able to match for the part of the strings that are the same.

Here a simpler example:

ref_df <- data.frame("reference" = c("swietenia macrophylla",                       
                                     "azadirachta indica",                        
                                     "cedrela odorata",                               
                                     "ochroma pyramidale",                            
                                     "tectona grandis",                               
                                     "tamarindus indica",                             
                                     "cariniana pyriformis",                          
                                     "paquita quinata",                               
                                     "albizia saman",                                 
                                     "enterolobium cyclocarpum",                      
                                     "tapirira guianensis",                           
                                     "dipteryx oleifera"),
                     "values" = c(rnorm(12)))

tofind_df <- c("swietenia macrophylla and try try",                       
           "azadirachta indica",                        
           "tamarindus indica (bla bla)",                             
           "tara",                          
           "bla bla (paquita quinata)",                               
           "prosopis pallida",                                 
           "dipteryx oleifera")

So I try to keep all the values of ref_df that have a name that matches even partially in tofond_df, but it only matches if they are the same.

 finale <- ref_df[ref_df$reference %in% tofind_df$names,]

I tried with grepl as well, but I couldn't find the solution.

My ideal finale should look like this:

                  reference       values
1     swietenia macrophylla -0.459001383    
2        azadirachta indica -0.430014486
3         tamarindus indica -0.541887328
4           paquita quinata -0.003572792
5         dipteryx oleifera -0.855659901

Please, think about two big df and not this easier situation.


Solution

  • We need to use sapply to get the results from grepl for every element

    ref_df[sapply(ref_df$reference, function(x) any(grepl(x, tofind_df))),]
    
                   reference     values
    1  swietenia macrophylla  1.4482830
    2     azadirachta indica  0.9037943
    6      tamarindus indica -0.2994678
    8        paquita quinata  0.4895183
    12     dipteryx oleifera -1.1652528