Search code examples
arraysrstringtrimdata-manipulation

Trim strings in the column with same pattern


I have a column populated with string with same pattern *.stage1. I want to grab every string, copy every string to another column as a bullet point; trim out ".stage1" and populate the first column with the every character before ".stage1".

This will save a lot of time, can you suggest a package that can help me create this script?

Thanks, Mago


Solution

  • Copying the column should not be an issue. You can make the altered version with sub.

    ## Some sample data
    df = data.frame(x = paste0("A", 1:9, ".stage1"))
    > df
              x
    1 A1.stage1
    2 A2.stage1
    3 A3.stage1
    4 A4.stage1
    5 A5.stage1
    6 A6.stage1
    7 A7.stage1
    8 A8.stage1
    9 A9.stage1
    
    df$x2 = df$x
    df$x = sub("(.*)\\.stage1", "\\1", df$x)
    df
       x        x2
    1 A1 A1.stage1
    2 A2 A2.stage1
    3 A3 A3.stage1
    4 A4 A4.stage1
    5 A5 A5.stage1
    6 A6 A6.stage1
    7 A7 A7.stage1
    8 A8 A8.stage1
    9 A9 A9.stage1
    

    Some extra detail on the sub statement.
    sub will replace everything matching the first expression with the second one. What are those expressions?

    First expression: "(.*)\\.stage1"
    . matches any character.
    .* matches any number of characters.
    Because .* is in parentheses, whatever it matches will be stored in a variable called \1.
    So "(.*)\\.stage1" will match the string ".stage1" and everything before it storing the characters before .stage1 in \1.

    Second expression: "\\1"
    We want to replace this with just the characters before, so the replacement string is "\\1".