Search code examples
rdata-cleaning

Standardize group names using a vector of possible matches


I need to standardize how subgroups are referred to in a data set. To do this I need to identify when a variable matches one of several strings and then set a new variable with the standardized name. I am trying to do that with the following:

df <- data.frame(a = c(1,2,3,4), b = c(depression_male, depression_female, depression_hsgrad, depression_collgrad))
TestVector <- "male"
for (i in TestVector) {
  df$grpl <- grepl(paste0(i), df$b)
  df[ which(df$grpl == TRUE),]$standard <- "male"
}

The test vector will frequently have multiple elements. The grepl works (I was going to deal with the male/female match confusion later but I'll take suggestions on that) but the subsetting and setting a new variable doesn't. It would be better (and work) if I could transform the grepl output directly into the standard name variable.


Solution

  • Your only real issue is that you need to initialize the standard column. But we can simplify your code a bit:

    df <- data.frame(a = c(1,2,3,4), b = c("depression_male", "depression_female", "depression_hsgrad", "depression_collgrad"))
    TestVector <- "male"
    df$standard <- NA
    for (i in TestVector) {
      df[ grepl(i, df$b), "standard"] <- "male"
    }
    df
    #   a                   b standard
    # 1 1     depression_male     male
    # 2 2   depression_female     male
    # 3 3   depression_hsgrad     <NA>
    # 4 4 depression_collgrad     <NA>
    

    Then you've got the issue that the "male" pattern matches "female" as well.

    Perhaps you're looking for sub instead? It works like find/replace:

    df$standard = sub(pattern = "depression_", replacement = "", df$b)
    df
    #   a                   b standard
    # 1 1     depression_male     male
    # 2 2   depression_female   female
    # 3 3   depression_hsgrad   hsgrad
    # 4 4 depression_collgrad collgrad
    

    It's hard to generalize what will be best in your case without more example input/output pairs. If all your data is of the form "depression_" this will work well. Or maybe the standard name is always after an underscore, so you could use pattern = ".*_" to replace everything before the last underscore. Or maybe something else... Hopefully these ideas give you a good start.