I had this dataframe of one column. I created a new column filenum
using case_when
as follow
library(tidyverse)
filename <- c('./1_test1', './1_test2',
'./2_test1', './2_test2',
'./3_test1', './3_test2')
df <- data.frame(filename)
df <- test %>% mutate(filenum = case_when(
length(grep('./1', df$filename)) > 0 ~ 1,
length(grep('./2', df$filename)) > 0 ~ 2,
length(grep('./3', df$filename)) > 0 ~ 3,
)
)
I expected the dataframe would become
filename filenum
1 ./1_test1 1
2 ./1_test2 1
3 ./2_test1 2
4 ./2_test2 2
5 ./3_test1 3
6 ./3_test2 3
However, I got this result
filename filenum
1 ./1_test1 1
2 ./1_test2 1
3 ./2_test1 1
4 ./2_test2 1
5 ./3_test1 1
6 ./3_test2 1
My questions are:
You can use grepl
instead of grep
, which output logical values. In this way, you don't need to set length(...) > 0
. Also, you don't need to use the df$
syntax to refer to columns in dplyr
, just the column name itself is enough.
Moreover, a dot .
means any character in regex in grepl
, so you better escape it, or use the fixed = TRUE
parameter. The complete code should be:
library(dplyr)
df %>%
mutate(filenum = case_when(grepl('./1', filename, fixed = T) ~ 1,
grepl('./2', filename, fixed = T) ~ 2,
grepl('./3', filename, fixed = T) ~ 3))
Or do it with sub
if you want to extract the digit after the /
symbol and before _
.
Regex explanation:
.
means any character in regular expression (regex), we need to "escape" it so that R knows we are referring to the literal dot character by adding double slashes before it \\.
.\\d
is another regex element which means digit, we put a plus sign +
after it to indicate one or more digits.(\\d+)
is wrapped in brackets, which indicates a regex capture group.\\1
to refer back to this capture group as the replacement pattern in sub
.df %>% mutate(filenum = sub("\\./(\\d+)_.*", "\\1", filename))
filename filenum
1 ./1_test1 1
2 ./1_test2 1
3 ./2_test1 2
4 ./2_test2 2
5 ./3_test1 3
6 ./3_test2 3