I have the following toy dataframe:
df <- data.frame(
product = c("apple", "banana", "cherry", "durian", "eggplant", "fuyu"),
ingredients = c("flour|fibre|500", "sugar|500", "505|wheat|flavouring", "fibre(500)|eggs", "wholegrainrice|sesameoil", "500|fibre|500"),
stringsAsFactors = FALSE
)
My aim is to detect whether fibre appears in a product's ingredients, count how many times it appears and extract the values used to record fibre in a product's ingredients.
For the purposes of this analysis, fibre can be represented in a product's ingredients as "fibre", "500" or "fibre(500)".
My current code is:
library(tidyverse)
fibre_strings_to_check <- c("fibre", "500", "fibre\\(500\\)")
df2 <- df %>%
mutate(
fibre_present = str_detect(ingredients, paste(fibre_strings_to_check, collapse = "|")),
fibre_count = str_count(ingredients, paste(fibre_strings_to_check, collapse = "|")),
fibre_used = str_extract_all(ingredients, paste(fibre_strings_to_check, collapse = "|"))
)
Which gives the output of df2
as:
| product | ingredients | fibre_present | fibre_count | fibre_used |
|----------|-----------------------------|---------------|-------------|-------------------|
| apple | flour\|fibre\|500 | TRUE | 2 | fibre, 500 |
| banana | sugar\|500 | TRUE | 1 | 500 |
| cherry | 505\|wheat\|flavouring | FALSE | 0 | |
| durian | fibre(500)\|eggs | TRUE | 2 | fibre, 500 |
| eggplant | wholegrainrice\|sesameoil | FALSE | 0 | |
| fuyu | 500\|fibre\|500 | TRUE | 3 | 500, fibre, 500 |
The issue I am having is with the "durian" product. I want "fibre(500)" to be counted as one value / instance of fibre as it has been defined in fibre_strings_to_check
. But because it seems to match the other instances of fibre in fibre_strings_to_check
, it's counting as two values / instances of fibre.
My intended output of df2
is:
| product | ingredients | fibre_present | fibre_count | fibre_used |
|----------|-----------------------------|---------------|-------------|-------------------|
| apple | flour\|fibre\|500 | TRUE | 2 | fibre, 500 |
| banana | sugar\|500 | TRUE | 1 | 500 |
| cherry | 505\|wheat\|flavouring | FALSE | 0 | |
| durian | fibre(500)\|eggs | TRUE | 1 | fibre(500) |
| eggplant | wholegrainrice\|sesameoil | FALSE | 0 | |
| fuyu | 500\|fibre\|500 | TRUE | 3 | 500, fibre, 500 |
How do I adjust the script so that there is no double counting of what is intended to be a single value?
A quick fix would be to rearrange the vector fibre_strings_to_check
so that "fibre\\(500\\)"
occurs first than rest of the values.
library(dplyr)
library(stringr)
fibre_strings_to_check <- c("fibre\\(500\\)", "fibre", "500")
fibre_regex <- paste(fibre_strings_to_check, collapse = "|")
df2 <- df %>%
mutate(
fibre_present = str_detect(ingredients, fibre_regex),
fibre_count = str_count(ingredients, fibre_regex),
fibre_used = str_extract_all(ingredients, fibre_regex)
)
df2
# product ingredients fibre_present fibre_count fibre_used
#1 apple flour|fibre|500 TRUE 2 fibre, 500
#2 banana sugar|500 TRUE 1 500
#3 cherry 505|wheat|flavouring FALSE 0
#4 durian fibre(500)|eggs TRUE 1 fibre(500)
#5 eggplant wholegrainrice|sesameoil FALSE 0
#6 fuyu 500|fibre|500 TRUE 3 500, fibre, 500