Search code examples
rdataframedplyrcase

Create a column based on the substring of another column using case_when


I had this dataframe of one column. I created a new column filenum using case_when as follow

library(tidyverse)
filename <- c('./1_test1', './1_test2', 
              './2_test1', './2_test2', 
              './3_test1', './3_test2')
df <- data.frame(filename)
df <- test %>% mutate(filenum = case_when(
  length(grep('./1', df$filename)) > 0 ~ 1,
  length(grep('./2', df$filename)) > 0 ~ 2,
  length(grep('./3', df$filename)) > 0 ~ 3,
  )
)

I expected the dataframe would become

   filename filenum
1 ./1_test1       1
2 ./1_test2       1
3 ./2_test1       2
4 ./2_test2       2
5 ./3_test1       3
6 ./3_test2       3

However, I got this result

   filename filenum
1 ./1_test1       1
2 ./1_test2       1
3 ./2_test1       1
4 ./2_test2       1
5 ./3_test1       1
6 ./3_test2       1

My questions are:

  1. What did I do wrong?
  2. Is there a better way to create a new column based on the substring of an existing column?

Solution

  • You can use grepl instead of grep, which output logical values. In this way, you don't need to set length(...) > 0. Also, you don't need to use the df$ syntax to refer to columns in dplyr, just the column name itself is enough.

    Moreover, a dot . means any character in regex in grepl, so you better escape it, or use the fixed = TRUE parameter. The complete code should be:

    library(dplyr)
    
    df %>% 
      mutate(filenum = case_when(grepl('./1', filename, fixed = T) ~ 1, 
                                 grepl('./2', filename, fixed = T) ~ 2, 
                                 grepl('./3', filename, fixed = T) ~ 3))
    

    Or do it with sub if you want to extract the digit after the / symbol and before _.

    Regex explanation:

    1. Since . means any character in regular expression (regex), we need to "escape" it so that R knows we are referring to the literal dot character by adding double slashes before it \\..
    2. \\d is another regex element which means digit, we put a plus sign + after it to indicate one or more digits.
    3. You can see that (\\d+) is wrapped in brackets, which indicates a regex capture group.
    4. Use \\1 to refer back to this capture group as the replacement pattern in sub.
    df %>% mutate(filenum = sub("\\./(\\d+)_.*", "\\1", filename))
    

    Output

       filename filenum
    1 ./1_test1       1
    2 ./1_test2       1
    3 ./2_test1       2
    4 ./2_test2       2
    5 ./3_test1       3
    6 ./3_test2       3