Search code examples
rlapplystrsplit

Is there a R function to strsplit specifc rows under one column based on another column?


I am having a data frame (for example as below:

name  student_id   age  gender
Sam   123_abc_ABC  20   F
John  234_bcd_BCD  18   M
Mark  345_cde_CDE  20   M
Ram   xyz_111_XYZ  19   M
Hari  uvw_444_UVW  23   M

Now, I need a new column as student_id_by_govt in the df. The student_id_by_govt is within the student_id but it is different for different names. For Sam, John, Mark the student_id_by_govt would be first segment of student_id (i.e., 123, 234, 345) but for Ram & Hari, the student_id_by_govt is second segment in the student_id (i.e.,111, 444).

I used the strsplit, lapply commands to get the specfic segment from the student_id but I could not able to apply that command specifically for specific rows to get the desired output mentioned above. Please let me know how to get the output as below:

name  student_id   age  gender student_id_by_govt
Sam   123_abc_ABC  20   F      123
John  234_bcd_BCD  18   M      234
Mark  345_cde_CDE  20   M      345
Ram   xyz_111_XYZ  19   M      111
Hari  uvw_444_UVW  23   M      444

Solution

  • You only need str_extract:

    library(tidyverse)
    df %>%
      mutate(student_id_by_govt = str_extract(student_id, "\\d+"))
    # A tibble: 5 × 3
      Name  student_id  student_id_by_govt
      <chr> <chr>       <chr>             
    1 Sam   123_abc_ABC 123               
    2 John  234_bcd_BCD 234               
    3 Mark  345_cde_CDE 345               
    4 Ram   xyz_111_XYZ 111               
    5 Hari  uvw_444_UVW 444 
    

    EDIT:

    If the student_id_by_govt is determined by gender, as OP notes in comment: "for all the 'M' gender, the student_id_by_govt is first segment in the student_id (ie., 234, 345, xyz, uvw) but for 'F' gender I want the second segment of student_id (i.e., abc)", then this works:

    df %>%
      mutate(student_id_by_govt = ifelse(gender == "M", str_extract(student_id, "^[^_]+"),
                                         str_extract(student_id, "(?<=_)[^_]+(?=_)")))
      name  student_id age gender student_id_by_govt
    1  Sam 123_abc_ABC  20      F                abc
    2 John 234_bcd_BCD  18      M                234
    3 Mark 345_cde_CDE  20      F                cde
    4  Ram xyz_111_XYZ  19      M                xyz
    5 Hari uvw_444_UVW  23      M                uvw
    

    Here, we essentially rely on the negative character class [^_]+, which matches any character but the underscore one or more times as well as positive lookbehind (?<=_)(meaning "match only if there is an underscore to the left") and lookahead (?=_)(meaning "match only if there is an underscore to the right").

    Data for solution in EDIT:

    df <- read.table(text="name  student_id   age  gender
    Sam   123_abc_ABC  20   F
    John  234_bcd_BCD  18   M
    Mark  345_cde_CDE  20   F
    Ram   xyz_111_XYZ  19   M
    Hari  uvw_444_UVW  23   M", header = TRUE)