Search code examples
rstringr

Why can't I supply str_detect with a column name argument?


I have this toy data as df:

structure(list(Product_Name = c("Delicious Chips", "Creamy Tomato Soup", 
"Cheesy Macaroni", "Savory Meatballs", "Crispy Chicken Tenders"
), Ingredients = c("Potato Slices | Vegetable Oil | Salt | Seasoning Blend", 
"Tomatoes | Water | Cream | Onions | Salt | Spices", "Macaroni | Cheese Sauce | Milk | Butter | Salt | Pepper", 
"Ground Meat | Breadcrumbs | Onions | Garlic | Spices", "Chicken Tenders | Breading Mix | Vegetable Oil | Salt | Pepper"
)), row.names = c(NA, 5L), class = "data.frame")

Here I want to find which rows contain "Salt" in the Ingredients variable.

Using library(tidyverse), initially I try df %>% str_detect(Ingredients, "Salt") but I get Error: object 'Ingredients' not found.

But when I change it to df %>% filter(str_detect(Ingredients, "Salt") it returns a dataframe with the products matching the string.

I thought str_detect needs a character vector or something coercible to one and I thought that Ingredients fit that because when I do class(df$Ingredients) it returns character. Why won't it take Ingredients as an argument and what changes when it is wrapped into filter()?


Solution

  • In many Tidyverse (e.g., dplyr) functions, unquoted variables that get passed along to functions use data masking which allow you to use unquoted data variables as if they were variables in the environment. We can see this when we use dplyr::filter:

    library(dplyr)
    
    df |> 
      filter(Product_Name == "Savory Meatballs")
    #>       Product_Name                                          Ingredients
    #> 1 Savory Meatballs Ground Meat | Breadcrumbs | Onions | Garlic | Spices
    

    Here filter is looking for and using the variable "Product_Name" within df, not within your global environment.

    However, str_detect, and most of the other functions from the stringr package, do not have this capability. As others have noted, you can nest your str_detect call within mutate or filter to see these results. But if you wanted to just pass along Ingredients to str_detect you can use the with function (more info about with on r-bloggers). This is what that looks like:

    library(stringr)
    
    df |>
      with(str_detect(Ingredients, "Salt"))
    #> [1]  TRUE  TRUE  TRUE FALSE  TRUE
    

    It does something very similar to what those dplyr functions are doing behind the scenes: rather than looking for a variable named "Ingredients" in your global environment (which is not defined because that is not what you want, you want it to be looking for "Ingredients" within df), it treats the first argument (df) as its own environment and looks for a variable called "Ingredients" in that environment instead.