Standardize group names using a vector of possible matches

I need to standardize how subgroups are referred to in a data set. To do this I need to identify when a variable matches one of several strings and then set a new variable with the standardized name. I am trying to do that with the following:

df <- data.frame(a = c(1,2,3,4), b = c(depression_male, depression_female, depression_hsgrad, depression_collgrad))
TestVector <- "male"
for (i in TestVector) {
  df$grpl <- grepl(paste0(i), df$b)
  df[ which(df$grpl == TRUE),]$standard <- "male"
}

The test vector will frequently have multiple elements. The grepl works (I was going to deal with the male/female match confusion later but I'll take suggestions on that) but the subsetting and setting a new variable doesn't. It would be better (and work) if I could transform the grepl output directly into the standard name variable.

Solution

Your only real issue is that you need to initialize the standard column. But we can simplify your code a bit:

df <- data.frame(a = c(1,2,3,4), b = c("depression_male", "depression_female", "depression_hsgrad", "depression_collgrad"))
TestVector <- "male"
df$standard <- NA
for (i in TestVector) {
  df[ grepl(i, df$b), "standard"] <- "male"
}
df
#   a                   b standard
# 1 1     depression_male     male
# 2 2   depression_female     male
# 3 3   depression_hsgrad     <NA>
# 4 4 depression_collgrad     <NA>

Then you've got the issue that the "male" pattern matches "female" as well.

Perhaps you're looking for sub instead? It works like find/replace:

df$standard = sub(pattern = "depression_", replacement = "", df$b)
df
#   a                   b standard
# 1 1     depression_male     male
# 2 2   depression_female   female
# 3 3   depression_hsgrad   hsgrad
# 4 4 depression_collgrad collgrad

It's hard to generalize what will be best in your case without more example input/output pairs. If all your data is of the form "depression_" this will work well. Or maybe the standard name is always after an underscore, so you could use pattern = ".*_" to replace everything before the last underscore. Or maybe something else... Hopefully these ideas give you a good start.