I have a data frame which consists of parent companies and brands. I would like to clean and recode the brands based on multiple vectors I have created for each brand that contain product models.
#Example dataframe
companies <- c("comp1","comp2","comp3", "comp4")
brands <- c("brand1", "brand2", "brand3", "brand4")
companies_brands <- cbind(companies, brands)
companies_brands <- data.frame(companies_brands)
#output
#Rows: 4
#Columns: 2
#$ companies <chr> "comp1", "comp2", "comp3", "comp4"
#$ brands <chr> "brand1", "brand2", "brand3", "brand4"
My dataset did not include the product model information, so I have created a product_model vector for each brand myself. See example below.
#Example product_model vectors
brand1_prod_mod <- c("b1prodmod1", "b1prodmod2", "b1prodmod3")
brand2_prod_mod <- c("b2prodmod1", "b2prodmod2", "b2prodmod3")
brand3_prod_mod <- c("b3prodmod1", "b3prodmod2", "b3prodmod3")
brand4_prod_mod <- c("b4prodmod1", "b4prodmod2", "b4prodmod3")
Some brands are incorrectly coded as product models so I would like to use something like the code below to recode/clean the brands variable. The code below runs but it only recodes some of the brands correctly. I know because I compare the original frequencies of brands to brand_r. I have tried to ensure that all strings match by trying various methods like str_replace_all() and tolower(), but it still isn't recoding fully. What's confusing is when I simply run setdiff() to isolate the difference between companies_brands$brand_r and each individual product_model vector, it properly accounts for all of the matching strings, which confirms that there is no format/space/case difference to fix.
companies_brands_r <- companies_brands %>% mutate(brand_r =
if_else(str_detect(brands, brand1_prod_mod), "brand1_R",
if_else(str_detect(brands, brand2_prod_mod), "brand2_R",
if_else(str_detect(brands, brand3_prod_mod), "brand3_R",
if_else(str_detect(brands, brand4_prod_mod), "brand4_R", brands)))))
If anyone has any idea what the issue is here, I would greatly appreciate any guidance!
You're close, but you would probably want to use %in%
instead of string matching and use case_when
instead of nested if_else
s.
I.e.
library(dplyr)
companies_brands |>
mutate(brand_r = case_when(brands %in% c("b1prodmod1", "b1prodmod2", "b1prodmod3") ~ "brand1_R",
brands %in% c("b2prodmod1", "b2prodmod2", "b2prodmod3") ~ "brand2_R",
brands %in% c("b3prodmod1", "b3prodmod2", "b3prodmod3") ~ "brand3_R",
brands %in% c("b4prodmod1", "b4prodmod2", "b4prodmod3") ~ "brand4_R",
T ~ brands))
Alternatively you could something like this with a str_replace
(however, you might need to do adapt the regex
depending on the names of the products):
library(dplyr)
library(stringr)
companies_brands |>
mutate(brand_r = str_replace(brands, "b(\\d).*", "brand\\1_R"))
Output (for both methods are the same):
companies brands brand_r
1 comp1 brand1 brand1
2 comp2 brand2 brand2
3 comp3 brand3 brand3
4 comp4 brand4 brand4
5 comp1 b1prodmod1 brand1_R
6 comp2 b2prodmod2 brand2_R
7 comp3 b3prodmod3 brand3_R
8 comp4 b4prodmod3 brand4_R
New data (you would want to include some data of the actual problem, so we can properly test it out. Use e.g. dput
):
companies <- c("comp1","comp2","comp3", "comp4", "comp1", "comp2", "comp3", "comp4")
brands <- c("brand1", "brand2", "brand3", "brand4", "b1prodmod1", "b2prodmod2", "b3prodmod3", "b4prodmod3")
companies_brands <- cbind(companies, brands)
companies_brands <- data.frame(companies_brands)