Search code examples
rtidyversestringr

How to modify str_replace_all


I am trying to standardize species names in a dataframe. I was using str_replace_all() to do so. However rather than replacing the old name the function is adding characters onto the new name. The names are mixed, some are already correct and some I need to fix, but I have 30+ species names to adjust. Any tips?

Example code:

library(tidyverse)

var1 <- c(1:5)
var2 <- c("Nemophila menziesii", "Nemophila menziesii var.menziesii", "Ceanothus cuneatus", "Ceanothus cuneatus var.cuneatus", "Diplacus auranticus")

df <- as.data.frame(var2, var1)

df %>%
  mutate(var2 = str_replace_all(var2, "Nemophila menziesii", "Nemophila menziesii var.menziesii")) %>%
  mutate(var2 = str_replace_all(var2, "Ceanothus cuneatus", "Ceanothus cuneatus var.cuneatus"))

Which outputs:

                                             var2
1               Nemophila menziesii var.menziesii
2 Nemophila menziesii var.menziesii var.menziesii
3                 Ceanothus cuneatus var.cuneatus
4    Ceanothus cuneatus var.cuneatus var.cuneatus
5                             Diplacus auranticus

Row 2 and 4 is a perfect example of the issue I am having. It should just be "Ceanothus cuneatus var.cuneatus" or "Nemophila menziesii var.menziesii".


Solution

  • This might work - if you place $ at the pattern to be replaced it will only replace it when it is the whole string in the variable:

    library(tidyverse)
    
    var1 <- c(1:5)
    var2 <- c("Nemophila menziesii", "Nemophila menziesii var.menziesii", "Ceanothus cuneatus", "Ceanothus cuneatus var.cuneatus", "Diplacus auranticus")
    
    df <- as.data.frame(var2, var1)
    
    df %>%
      mutate(var2 = str_replace_all(var2, "Nemophila menziesii$", "Nemophila menziesii var.menziesii")) %>%
      mutate(var2 = str_replace_all(var2, "Ceanothus cuneatus$", "Ceanothus cuneatus var.cuneatus"))
    
    #>                                var2
    #> 1 Nemophila menziesii var.menziesii
    #> 2 Nemophila menziesii var.menziesii
    #> 3   Ceanothus cuneatus var.cuneatus
    #> 4   Ceanothus cuneatus var.cuneatus
    #> 5               Diplacus auranticus
    

    In the original examples, str_replace_all is finding the pattern perfectly in the strings in rows 2 and 4, and replacing them with the replacement string. But that happens within the strings and leaves the suffixes in place. The above syntax says "find the pattern only when it doesn't include anything after it".

    What you were maybe thinking of is a sort of if_else or case_when replacement setup? Like the below, which tests for whole string matches and gives same output:

    df %>%
      mutate(var2 = case_when(
        var2 == "Nemophila menziesii" ~ "Nemophila menziesii var.menziesii",
        var2 == "Ceanothus cuneatus" ~ "Ceanothus cuneatus var.cuneatus",
        TRUE ~ var2))
    
    #>                                var2
    #> 1 Nemophila menziesii var.menziesii
    #> 2 Nemophila menziesii var.menziesii
    #> 3   Ceanothus cuneatus var.cuneatus
    #> 4   Ceanothus cuneatus var.cuneatus
    #> 5               Diplacus auranticus
    

    Created on 2022-03-18 by the reprex package (v2.0.1)