Search code examples
rdplyrtidyversepurrrtidytext

`str_detect()` and `map()` to iterate through many string detections


My data is in the format below. (Code for data input at the very end, below question).

#> df
#>  id amount description
#>   1     10 electricity
#>   2    100        rent
#>   3      4        fees

I would like to be able to classify the transactions (rows), based on whether certain strings are in the description.

So for example:

library(tidyverse)
df <- df %>% 
  mutate(category = ifelse(str_detect(description, "elec"), "bills", description))

which gives:

#>   id amount description category
#> 1  1     10 electricity    bills
#> 2  2    100        rent         
#> 3  3      4        fees

I'd like to be able to define a vector of keywords and the associated categories, as below:

keywords <- c(electric = "bills",
              rent = "bills",
              fees = "misc")

What is the next step to be able to create the categories column with the correct labels?

Desired Output:


#>   id amount description category
#> 1  1     10 electricity    bills
#> 2  2    100        rent    bills         
#> 3  3      4        fees    misc

I've tried map2_df, but I must be doing something wrong, because the code below creates three versions of the df stacked on top of each other:

categorise_transactions <- function(keyword, category){df <- df %>% 
  mutate(category = ifelse(str_detect(description, keyword), category, description))}

library(purrr)
map2_df(names(keywords), keywords, categorise_transactions)

code for data input below:

df <- data.frame(
  stringsAsFactors = FALSE,
                id = c(1L, 2L, 3L),
            amount = c(10L, 100L, 4L),
       description = c("electricity", "rent", "fees")
)
df

Solution

  • str_replace_all almost gives what you need :

    library(dplyr)
    library(stringr)
    
    str_replace_all(df$description, keywords)
    #[1] "billsity" "bills"    "misc" 
    

    However, as suggested by @Russ Thomas case_when gives exactly what you need.

    library(dplyr)
    library(stringr)
    
    df %>%
      mutate(category = case_when(str_detect(description, 'electric') ~ 'bills', 
                                  str_detect(description, 'rent') ~ 'bills', 
                                  str_detect(description, 'fees') ~ 'misc'))  
    
    
    #  id amount description category
    #1  1     10 electricity    bills
    #2  2    100        rent    bills
    #3  3      4        fees     misc