Search code examples
rdplyrcase-when

Renaming coat colors in R goes wrong with str_detect


I have a dataset with horses and want to group them based on coat colors. In my dataset more than 140 colors are used, I would like to go back to only a few coat colors and assign the rest to Other. But for some horses the coat color has not been registered, i.e. those are unknown. Below is what the new colors should be. (To illustrate the problem I have an old coat color and a new one. But I want to simply change the coat colors, not create a new column with colors)

Horse ID Coatcolor(old) Coatcolor
1 black Black
2 bayspotted Spotted
3 chestnut Chestnut
4 grey Grey
5 cream dun Other
6 Unknown
7 blue roan Other
8 chestnutgrey Grey
9 blackspotted Spotted
10 Unknown

Instead, I get the data below(second table), where unknown and other are switched.

Horse ID Coatcolor
1 Black
2 Spotted
3 Chestnut
4 Grey
5 Unknown
6 Other
7 Unknown
8 Grey
9 Spotted
10 Other

I used the following code

mydata <- data %>%
  mutate(Coatcolor = case_when(
     str_detect(Coatcolor, "spotted") ~ "Spotted",
     str_detect(Coatcolor, "grey") ~ "Grey",
     str_detect(Coatcolor, "chestnut") ~ "Chestnut",
     str_detect(Coatcolor, "black") ~ "Black",
     str_detect(Coatcolor, "") ~ "Unknown",
     TRUE ~ Coatcolor
  ))
mydata$Coatcolor[!mydata$Coatcolor %in% c("Spotted", "Grey", "Chestnut", "Black", "Unknown")] <- "Other"

So what am I doing wrong/missing? Thanks in advance.


Solution

  • You can use the recode function of thedplyr package. Assuming the missing spots are NA' s, you can then subsequently set all NA's to "Other" with replace_na of the tidyr package. It depends on the format of your missing data spots.

    mydata <- tibble(
      id = 1:10,
      coatcol = letters[1:10]
    ) 
    
    mydata$coatcol[5] <- NA
    mydata$coatcol[4] <- ""
    
    mydata <- mydata %>%
      mutate_all(list(~na_if(.,""))) %>% # convert empty string to NA
      mutate(Coatcolor_old = replace_na(coatcol, "Unknown")) %>% #set all NA to Unknown
      mutate(Coatcolor_new = recode(
        Coatcolor_old,
        'spotted'= 'Spotted',
        'bayspotted' = 'Spotted',
        'old_name' = 'new_name',
        'a' = 'A', #etc.
      ))
    mydata