Search code examples
rdplyrsubstitutionstartswith

substitute values in dataframe based on partial match


This is my data

> df1
        col1      col2
1  0/0:6:6,0 0/0:6:6,0
2  0/0:6:6,0 0/1:6:6,0
...
6  1/1:6:6,0 0/0:6:6,0
7  0/0:8:8,0 0/0:8:8,0

What I want is to substitute long entries like "0/0:6:6,0" with just 0 if they start with "0/0", 0.5 if they start with "0/1" etc.

So far I have tried this:

1) replace-starts_with

df %>% mutate(col1 = replace(col1, starts_with("0/0"), 0)) %>% head()
    Error in mutate_impl(.data, dots) : 
      Evaluation error: Variable context not set.
    In addition: Warning message:
    In `[<-.factor`(`*tmp*`, list, value = 0) :
      invalid factor level, NA generated

2) grep (seen this as a solution here)

df[,1][grep("0/1",df[,1])]<-0.5
Warning message:
In `[<-.factor`(`*tmp*`, grep("0/1", df[, 1]), value = c(NA, 2L,  :
  invalid factor level, NA generated

Feeling lost... it's been a long day


Solution

  • We can use grepl

    df1 %>%
       mutate(col1 = replace(col1, grepl("^0/0", col1), 0))
    #       col1      col2
    #1         0 0/0:6:6,0
    #2         0 0/1:6:6,0
    #3 1/1:6:6,0 0/0:6:6,0
    #4         0 0/0:8:8,0
    

    Or use startsWith from base R

    df1 %>%
        mutate(col1 = replace(col1, startsWith(col1, "0/0"), 0))
    

    The issue with dplyr::starts_with is that it is a helper function to select variables based on their names

    df1 %>%
        select(starts_with('col1'))
    #       col1
    #1 0/0:6:6,0
    #2 0/0:6:6,0
    #6 1/1:6:6,0
    #7 0/0:8:8,0
    

    and not the values of the variables whereas startsWith returns a logical vector as grepl

    startsWith(df1$col1, "0/0")
    #[1]  TRUE  TRUE FALSE  TRUE