Search code examples
rstringgrepl

How do I supply agrepl with a list of strings to check?


I am trying to use agrepl to detect whether the Ingredients variable in my dataframe df contains one of a number of possible strings (food ingredients). I want to account for slight mispellings or errors. I am working in an environment where installing packages is difficult so I am keen to use agrepl. df is a very simplified version of the actual data for illustration and I've put the data for df at the end of this question.

These are the strings I want to check:

strings_to_check <- c("Molybdenum Salt",
                      "Mineral Salt \\(Molybdenum Sulfide)",
                      "Molybdenum Sulfide",
                      "Mineral Salt \\(444\\)",
                      "444")

I can detect the presence of these strings as expected with grepl:

ingredients_df <- df %>% 
  mutate(Molybdenum = grepl(paste(strings_to_check, collapse = "|"), Ingredients))

And when I use agrepl with a single string, it is also working as expected:

one_string_df <- ingredients_df %>% 
  mutate(One_String = agrepl("Molybdenum Sulfide", Ingredients, max.distance = 2, ignore.case = TRUE))

But agrepl with the full strings_to_check returns FALSE values for every case:

fuzzy_df <- ingredients_df %>% 
  mutate(Fuzzy_Molybdenum = agrepl(paste(strings_to_check, collapse = "|"), Ingredients, max.distance = 2, ignore.case = TRUE))

Given the difference between supplying a single string versus strings_to_check, I think there must be an issue with the way agrepl is using strings_to_check. How should I pass the list of strings into agrepl so it works as expected?

My expected output is:

   Product_Name                        Ingredients                 Issue Molybdenum Fuzzy_Molybdenum
   <chr>                               <chr>                       <chr> <lgl>      <lgl>           
 1 Cheesy Jalapeno Popcorn             Sugar | Croutons (10%) (Wh… Mino… FALSE      TRUE            
 2 Creamy Coconut Curry Soup           Premix [Salt | Mineral Sal… NA    TRUE       TRUE            
 3 Crunchy Cheddar Bites               Vegetable Oils (Palm | Can… NA    FALSE      FALSE
 4 Exotic Thai Basil Noodles           Natural Cheese Flavour [Ma… NA    TRUE       TRUE
 5 Golden Honey Wheat Bread            Sesame Seeds (3%) | Yeast … Lowe… FALSE      TRUE            
 6 Gourmet Truffle Macaroni & Cheese   Rice Flour | Thickener (14… NA    TRUE       TRUE            
 7 Heavenly Hazelnut Delight Ice Cream Acidity Regulator (339) | … Majo… FALSE      FALSE           
 8 Juicy Pineapple Burst Sorbet        Maltodextrin | Salt | Suga… NA    TRUE       TRUE            
 9 Maple Glazed Pecan Granola          Dried Vegetables (9%) (Pea… NA    TRUE       TRUE            
10 Mediterranean Herb Garden Hummus    Electrolytes 11.5% (Sodium… NA    FALSE      FALSE           
11 Roasted Garlic Parmesan Pretzels    Dextrose | Rice Flour | Wh… NA    FALSE      FALSE           
12 Smoky BBQ Bliss Potato Chips        Minerals (Calcium Phosphat… Mino… FALSE      TRUE            
13 Spicy Mango Tango Salsa             Maltodextrin | Filtered Wa… NA    TRUE       TRUE            
14 Sweet Cinnamon Swirl Pancakes       Bacon (15%) [Pork | Salt |… NA    FALSE      FALSE           
15 Zesty Lemonade Infusion             Onion Powder (Yeast Extrac… Repe… TRUE       TRUE 

Data for df:

structure(list(Product_Name = c("Cheesy Jalapeno Popcorn", "Creamy Coconut Curry Soup", 
"Crunchy Cheddar Bites", "Exotic Thai Basil Noodles", "Golden Honey Wheat Bread", 
"Gourmet Truffle Macaroni & Cheese", "Heavenly Hazelnut Delight Ice Cream", 
"Juicy Pineapple Burst Sorbet", "Maple Glazed Pecan Granola", 
"Mediterranean Herb Garden Hummus", "Roasted Garlic Parmesan Pretzels", 
"Smoky BBQ Bliss Potato Chips", "Spicy Mango Tango Salsa", "Sweet Cinnamon Swirl Pancakes", 
"Zesty Lemonade Infusion"), Ingredients = c("Sugar | Croutons (10%) (Wheat Flour | Vegetable Oil | Salt | Yeast) | Mineral Salt (Molybdenu Sulfide) | Salt | Natural Flavour", 
"Premix [Salt | Mineral Salts (451 | 452 | 444 | 450) | Sugar | Vegetable Gum (407a) | Flavour Enhancers (631 | 627)} | Natural Flavour", 
"Vegetable Oils (Palm | Canola) | Iodised Salt | Yellow Pea Flour", 
"Natural Cheese Flavour [Maltodextrin | Salt | Natural Flavour | Dextrose | Molybdenum Sulfide (444) | Yeast Extract]", 
"Sesame Seeds (3%) | Yeast | Yellow Pea Flour | molybdenum sulfide | Vitamins (Thiamin | Folic Acid)", 
"Rice Flour | Thickener (1412) | Salt | Molybdenum Sulfide (Natural Source) | Herbs | Mineral Salt (451) Preservative (223)", 
"Acidity Regulator (339) | Antioxidant (316) | Mylabdenu Sulfini | Colour Fixative (Sodium Nitrite)", 
"Maltodextrin | Salt | Sugar | Natural Flavours (Contains Wheat | Soy) | Dried Vegetables [Onion | Carrot] | Mineral Salt (444)", 
"Dried Vegetables (9%) (Peas | Vegetable Powder | Sugar | Mineral Salt (444) | Yeast Extract | Vegetable Oil | Herbs & Spices | Natural Colour (100)", 
"Electrolytes 11.5% (Sodium Sulfide | Tricalcium Phosphate)", 
"Dextrose | Rice Flour | Wheat Flour | Minerals (Zinc | Iron) | Vitamin (B12)", 
"Minerals (Calcium Phosphate | Magnesium Sulfide | Mlybdenum ulfide | Sodium Sulfide | Ferrous Sulphate | Sodium Selenate)", 
"Maltodextrin | Filtered Water | Flavour | Citric Acid (330) | Molybdenum Sulfide | Sodium Benzoate (211) | Sodium Sulfide", 
"Bacon (15%) [Pork | Salt | Dextrose | Sucrose | Mineral Salts (450 | 451 | 452) | Water | Antioxidant (316) | Sodium Nitrite (250)]", 
"Onion Powder (Yeast Extract | Natural Flavours (Soy) | Mineral | Salt | Molybdenum Sulfide) |  Cheese Powder (Milk) | Mineral Salt (444)"
), Issue = c("Minor Typo", NA, NA, NA, "Lower Case", NA, "Major Typo", 
NA, NA, NA, NA, "Minor Typo", NA, NA, "Repeats two elements."
)), class = c("tbl_df", "tbl", "data.frame"), row.names = c(NA, 
-15L))

Solution

  • The issue that's tripped you up is that agrepl() and grepl() have opposite default values for the fixed argument (TRUE and FALSE respectively). In your attempt it is searching using your concatenated terms as a single string, not a regular expression containing multiple terms. Use agrepl(fixed = FALSE).

    library(dplyr)
    ingredients %>%
      mutate(
        Fuzzy_Molybdenum = agrepl(
          paste(strings_to_check, collapse = "|"),
          Ingredients,
          max.distance = 2,
          ignore.case = TRUE,
          fixed = FALSE
        )
      )
    
    # A tibble: 15 × 4
       Product_Name                        Ingredients                            Issue Fuzzy_Molybdenum
       <chr>                               <chr>                                  <chr> <lgl>           
     1 Cheesy Jalapeno Popcorn             Sugar | Croutons (10%) (Wheat Flour |… Mino… TRUE            
     2 Creamy Coconut Curry Soup           Premix [Salt | Mineral Salts (451 | 4… NA    TRUE            
     3 Crunchy Cheddar Bites               Vegetable Oils (Palm | Canola) | Iodi… NA    FALSE           
     4 Exotic Thai Basil Noodles           Natural Cheese Flavour [Maltodextrin … NA    TRUE            
     5 Golden Honey Wheat Bread            Sesame Seeds (3%) | Yeast | Yellow Pe… Lowe… TRUE            
     6 Gourmet Truffle Macaroni & Cheese   Rice Flour | Thickener (1412) | Salt … NA    TRUE            
     7 Heavenly Hazelnut Delight Ice Cream Acidity Regulator (339) | Antioxidant… Majo… FALSE           
     8 Juicy Pineapple Burst Sorbet        Maltodextrin | Salt | Sugar | Natural… NA    TRUE            
     9 Maple Glazed Pecan Granola          Dried Vegetables (9%) (Peas | Vegetab… NA    TRUE            
    10 Mediterranean Herb Garden Hummus    Electrolytes 11.5% (Sodium Sulfide | … NA    FALSE           
    11 Roasted Garlic Parmesan Pretzels    Dextrose | Rice Flour | Wheat Flour |… NA    FALSE           
    12 Smoky BBQ Bliss Potato Chips        Minerals (Calcium Phosphate | Magnesi… Mino… TRUE            
    13 Spicy Mango Tango Salsa             Maltodextrin | Filtered Water | Flavo… NA    TRUE            
    14 Sweet Cinnamon Swirl Pancakes       Bacon (15%) [Pork | Salt | Dextrose |… NA    TRUE            
    15 Zesty Lemonade Infusion             Onion Powder (Yeast Extract | Natural… Repe… TRUE