I am trying to use agrepl to detect whether the Ingredients
variable in my dataframe df
contains one of a number of possible strings (food ingredients). I want to account for slight mispellings or errors. I am working in an environment where installing packages is difficult so I am keen to use agrepl. df
is a very simplified version of the actual data for illustration and I've put the data for df
at the end of this question.
These are the strings I want to check:
strings_to_check <- c("Molybdenum Salt",
"Mineral Salt \\(Molybdenum Sulfide)",
"Molybdenum Sulfide",
"Mineral Salt \\(444\\)",
"444")
I can detect the presence of these strings as expected with grepl
:
ingredients_df <- df %>%
mutate(Molybdenum = grepl(paste(strings_to_check, collapse = "|"), Ingredients))
And when I use agrepl
with a single string, it is also working as expected:
one_string_df <- ingredients_df %>%
mutate(One_String = agrepl("Molybdenum Sulfide", Ingredients, max.distance = 2, ignore.case = TRUE))
But agrepl
with the full strings_to_check
returns FALSE values for every case:
fuzzy_df <- ingredients_df %>%
mutate(Fuzzy_Molybdenum = agrepl(paste(strings_to_check, collapse = "|"), Ingredients, max.distance = 2, ignore.case = TRUE))
Given the difference between supplying a single string versus strings_to_check
, I think there must be an issue with the way agrepl is using strings_to_check
. How should I pass the list of strings into agrepl so it works as expected?
My expected output is:
Product_Name Ingredients Issue Molybdenum Fuzzy_Molybdenum
<chr> <chr> <chr> <lgl> <lgl>
1 Cheesy Jalapeno Popcorn Sugar | Croutons (10%) (Wh… Mino… FALSE TRUE
2 Creamy Coconut Curry Soup Premix [Salt | Mineral Sal… NA TRUE TRUE
3 Crunchy Cheddar Bites Vegetable Oils (Palm | Can… NA FALSE FALSE
4 Exotic Thai Basil Noodles Natural Cheese Flavour [Ma… NA TRUE TRUE
5 Golden Honey Wheat Bread Sesame Seeds (3%) | Yeast … Lowe… FALSE TRUE
6 Gourmet Truffle Macaroni & Cheese Rice Flour | Thickener (14… NA TRUE TRUE
7 Heavenly Hazelnut Delight Ice Cream Acidity Regulator (339) | … Majo… FALSE FALSE
8 Juicy Pineapple Burst Sorbet Maltodextrin | Salt | Suga… NA TRUE TRUE
9 Maple Glazed Pecan Granola Dried Vegetables (9%) (Pea… NA TRUE TRUE
10 Mediterranean Herb Garden Hummus Electrolytes 11.5% (Sodium… NA FALSE FALSE
11 Roasted Garlic Parmesan Pretzels Dextrose | Rice Flour | Wh… NA FALSE FALSE
12 Smoky BBQ Bliss Potato Chips Minerals (Calcium Phosphat… Mino… FALSE TRUE
13 Spicy Mango Tango Salsa Maltodextrin | Filtered Wa… NA TRUE TRUE
14 Sweet Cinnamon Swirl Pancakes Bacon (15%) [Pork | Salt |… NA FALSE FALSE
15 Zesty Lemonade Infusion Onion Powder (Yeast Extrac… Repe… TRUE TRUE
Data for df
:
structure(list(Product_Name = c("Cheesy Jalapeno Popcorn", "Creamy Coconut Curry Soup",
"Crunchy Cheddar Bites", "Exotic Thai Basil Noodles", "Golden Honey Wheat Bread",
"Gourmet Truffle Macaroni & Cheese", "Heavenly Hazelnut Delight Ice Cream",
"Juicy Pineapple Burst Sorbet", "Maple Glazed Pecan Granola",
"Mediterranean Herb Garden Hummus", "Roasted Garlic Parmesan Pretzels",
"Smoky BBQ Bliss Potato Chips", "Spicy Mango Tango Salsa", "Sweet Cinnamon Swirl Pancakes",
"Zesty Lemonade Infusion"), Ingredients = c("Sugar | Croutons (10%) (Wheat Flour | Vegetable Oil | Salt | Yeast) | Mineral Salt (Molybdenu Sulfide) | Salt | Natural Flavour",
"Premix [Salt | Mineral Salts (451 | 452 | 444 | 450) | Sugar | Vegetable Gum (407a) | Flavour Enhancers (631 | 627)} | Natural Flavour",
"Vegetable Oils (Palm | Canola) | Iodised Salt | Yellow Pea Flour",
"Natural Cheese Flavour [Maltodextrin | Salt | Natural Flavour | Dextrose | Molybdenum Sulfide (444) | Yeast Extract]",
"Sesame Seeds (3%) | Yeast | Yellow Pea Flour | molybdenum sulfide | Vitamins (Thiamin | Folic Acid)",
"Rice Flour | Thickener (1412) | Salt | Molybdenum Sulfide (Natural Source) | Herbs | Mineral Salt (451) Preservative (223)",
"Acidity Regulator (339) | Antioxidant (316) | Mylabdenu Sulfini | Colour Fixative (Sodium Nitrite)",
"Maltodextrin | Salt | Sugar | Natural Flavours (Contains Wheat | Soy) | Dried Vegetables [Onion | Carrot] | Mineral Salt (444)",
"Dried Vegetables (9%) (Peas | Vegetable Powder | Sugar | Mineral Salt (444) | Yeast Extract | Vegetable Oil | Herbs & Spices | Natural Colour (100)",
"Electrolytes 11.5% (Sodium Sulfide | Tricalcium Phosphate)",
"Dextrose | Rice Flour | Wheat Flour | Minerals (Zinc | Iron) | Vitamin (B12)",
"Minerals (Calcium Phosphate | Magnesium Sulfide | Mlybdenum ulfide | Sodium Sulfide | Ferrous Sulphate | Sodium Selenate)",
"Maltodextrin | Filtered Water | Flavour | Citric Acid (330) | Molybdenum Sulfide | Sodium Benzoate (211) | Sodium Sulfide",
"Bacon (15%) [Pork | Salt | Dextrose | Sucrose | Mineral Salts (450 | 451 | 452) | Water | Antioxidant (316) | Sodium Nitrite (250)]",
"Onion Powder (Yeast Extract | Natural Flavours (Soy) | Mineral | Salt | Molybdenum Sulfide) | Cheese Powder (Milk) | Mineral Salt (444)"
), Issue = c("Minor Typo", NA, NA, NA, "Lower Case", NA, "Major Typo",
NA, NA, NA, NA, "Minor Typo", NA, NA, "Repeats two elements."
)), class = c("tbl_df", "tbl", "data.frame"), row.names = c(NA,
-15L))
The issue that's tripped you up is that agrepl()
and grepl()
have opposite default values for the fixed
argument (TRUE
and FALSE
respectively). In your attempt it is searching using your concatenated terms as a single string, not a regular expression containing multiple terms. Use agrepl(fixed = FALSE)
.
library(dplyr)
ingredients %>%
mutate(
Fuzzy_Molybdenum = agrepl(
paste(strings_to_check, collapse = "|"),
Ingredients,
max.distance = 2,
ignore.case = TRUE,
fixed = FALSE
)
)
# A tibble: 15 × 4
Product_Name Ingredients Issue Fuzzy_Molybdenum
<chr> <chr> <chr> <lgl>
1 Cheesy Jalapeno Popcorn Sugar | Croutons (10%) (Wheat Flour |… Mino… TRUE
2 Creamy Coconut Curry Soup Premix [Salt | Mineral Salts (451 | 4… NA TRUE
3 Crunchy Cheddar Bites Vegetable Oils (Palm | Canola) | Iodi… NA FALSE
4 Exotic Thai Basil Noodles Natural Cheese Flavour [Maltodextrin … NA TRUE
5 Golden Honey Wheat Bread Sesame Seeds (3%) | Yeast | Yellow Pe… Lowe… TRUE
6 Gourmet Truffle Macaroni & Cheese Rice Flour | Thickener (1412) | Salt … NA TRUE
7 Heavenly Hazelnut Delight Ice Cream Acidity Regulator (339) | Antioxidant… Majo… FALSE
8 Juicy Pineapple Burst Sorbet Maltodextrin | Salt | Sugar | Natural… NA TRUE
9 Maple Glazed Pecan Granola Dried Vegetables (9%) (Peas | Vegetab… NA TRUE
10 Mediterranean Herb Garden Hummus Electrolytes 11.5% (Sodium Sulfide | … NA FALSE
11 Roasted Garlic Parmesan Pretzels Dextrose | Rice Flour | Wheat Flour |… NA FALSE
12 Smoky BBQ Bliss Potato Chips Minerals (Calcium Phosphate | Magnesi… Mino… TRUE
13 Spicy Mango Tango Salsa Maltodextrin | Filtered Water | Flavo… NA TRUE
14 Sweet Cinnamon Swirl Pancakes Bacon (15%) [Pork | Salt | Dextrose |… NA TRUE
15 Zesty Lemonade Infusion Onion Powder (Yeast Extract | Natural… Repe… TRUE