I am having a data frame (for example as below:
name student_id age gender
Sam 123_abc_ABC 20 F
John 234_bcd_BCD 18 M
Mark 345_cde_CDE 20 M
Ram xyz_111_XYZ 19 M
Hari uvw_444_UVW 23 M
Now, I need a new column as student_id_by_govt in the df. The student_id_by_govt is within the student_id but it is different for different names. For Sam, John, Mark the student_id_by_govt would be first segment of student_id (i.e., 123, 234, 345) but for Ram & Hari, the student_id_by_govt is second segment in the student_id (i.e.,111, 444).
I used the strsplit, lapply commands to get the specfic segment from the student_id but I could not able to apply that command specifically for specific rows to get the desired output mentioned above. Please let me know how to get the output as below:
name student_id age gender student_id_by_govt
Sam 123_abc_ABC 20 F 123
John 234_bcd_BCD 18 M 234
Mark 345_cde_CDE 20 M 345
Ram xyz_111_XYZ 19 M 111
Hari uvw_444_UVW 23 M 444
You only need str_extract
:
library(tidyverse)
df %>%
mutate(student_id_by_govt = str_extract(student_id, "\\d+"))
# A tibble: 5 × 3
Name student_id student_id_by_govt
<chr> <chr> <chr>
1 Sam 123_abc_ABC 123
2 John 234_bcd_BCD 234
3 Mark 345_cde_CDE 345
4 Ram xyz_111_XYZ 111
5 Hari uvw_444_UVW 444
EDIT:
If the student_id_by_govt
is determined by gender
, as OP notes in comment: "for all the 'M' gender, the student_id_by_govt
is first segment in the student_id
(ie., 234, 345, xyz, uvw) but for 'F' gender
I want the second segment of student_id
(i.e., abc)", then this works:
df %>%
mutate(student_id_by_govt = ifelse(gender == "M", str_extract(student_id, "^[^_]+"),
str_extract(student_id, "(?<=_)[^_]+(?=_)")))
name student_id age gender student_id_by_govt
1 Sam 123_abc_ABC 20 F abc
2 John 234_bcd_BCD 18 M 234
3 Mark 345_cde_CDE 20 F cde
4 Ram xyz_111_XYZ 19 M xyz
5 Hari uvw_444_UVW 23 M uvw
Here, we essentially rely on the negative character class [^_]+
, which matches any character but the underscore one or more times as well as positive lookbehind (?<=_)
(meaning "match only if there is an underscore to the left") and lookahead (?=_)
(meaning "match only if there is an underscore to the right").
Data for solution in EDIT:
df <- read.table(text="name student_id age gender
Sam 123_abc_ABC 20 F
John 234_bcd_BCD 18 M
Mark 345_cde_CDE 20 F
Ram xyz_111_XYZ 19 M
Hari uvw_444_UVW 23 M", header = TRUE)