Search code examples
rstringr

How do I correctly extract my intended values using str_extract_all?


I have the following toy dataframe:

df <- data.frame(
  product = c("apple", "banana", "cherry", "durian", "eggplant", "fuyu"),
  ingredients = c("flour|fibre|500", "sugar|500", "505|wheat|flavouring", "fibre(500)|eggs", "wholegrainrice|sesameoil", "500|fibre|500"),
  stringsAsFactors = FALSE
)

My aim is to detect whether fibre appears in a product's ingredients, count how many times it appears and extract the values used to record fibre in a product's ingredients.

For the purposes of this analysis, fibre can be represented in a product's ingredients as "fibre", "500" or "fibre(500)".

My current code is:

library(tidyverse)

fibre_strings_to_check <- c("fibre", "500", "fibre\\(500\\)")

df2 <- df %>%
  mutate(
    fibre_present = str_detect(ingredients, paste(fibre_strings_to_check, collapse = "|")),
    fibre_count = str_count(ingredients, paste(fibre_strings_to_check, collapse = "|")),
    fibre_used = str_extract_all(ingredients, paste(fibre_strings_to_check, collapse = "|"))
  )

Which gives the output of df2 as:

| product  | ingredients                 | fibre_present | fibre_count | fibre_used        |
|----------|-----------------------------|---------------|-------------|-------------------|
| apple    | flour\|fibre\|500           | TRUE          | 2           | fibre, 500        |
| banana   | sugar\|500                  | TRUE          | 1           | 500               |
| cherry   | 505\|wheat\|flavouring      | FALSE         | 0           |                   |
| durian   | fibre(500)\|eggs            | TRUE          | 2           | fibre, 500        |
| eggplant | wholegrainrice\|sesameoil   | FALSE         | 0           |                   |
| fuyu     | 500\|fibre\|500             | TRUE          | 3           | 500, fibre, 500   |

The issue I am having is with the "durian" product. I want "fibre(500)" to be counted as one value / instance of fibre as it has been defined in fibre_strings_to_check. But because it seems to match the other instances of fibre in fibre_strings_to_check, it's counting as two values / instances of fibre.

My intended output of df2 is:

| product  | ingredients                 | fibre_present | fibre_count | fibre_used        |
|----------|-----------------------------|---------------|-------------|-------------------|
| apple    | flour\|fibre\|500           | TRUE          | 2           | fibre, 500        |
| banana   | sugar\|500                  | TRUE          | 1           | 500               |
| cherry   | 505\|wheat\|flavouring      | FALSE         | 0           |                   |
| durian   | fibre(500)\|eggs            | TRUE          | 1           | fibre(500)        |
| eggplant | wholegrainrice\|sesameoil   | FALSE         | 0           |                   |
| fuyu     | 500\|fibre\|500             | TRUE          | 3           | 500, fibre, 500   |

How do I adjust the script so that there is no double counting of what is intended to be a single value?


Solution

  • A quick fix would be to rearrange the vector fibre_strings_to_check so that "fibre\\(500\\)" occurs first than rest of the values.

    library(dplyr)
    library(stringr)
    
    fibre_strings_to_check <- c("fibre\\(500\\)", "fibre", "500")
    fibre_regex <- paste(fibre_strings_to_check, collapse = "|")
    
    df2 <- df %>%
      mutate(
        fibre_present = str_detect(ingredients, fibre_regex),
        fibre_count = str_count(ingredients, fibre_regex),
        fibre_used = str_extract_all(ingredients, fibre_regex)
      )
    
    df2
    #   product              ingredients fibre_present fibre_count      fibre_used
    #1    apple          flour|fibre|500          TRUE           2      fibre, 500
    #2   banana                sugar|500          TRUE           1             500
    #3   cherry     505|wheat|flavouring         FALSE           0                
    #4   durian          fibre(500)|eggs          TRUE           1      fibre(500)
    #5 eggplant wholegrainrice|sesameoil         FALSE           0                
    #6     fuyu            500|fibre|500          TRUE           3 500, fibre, 500