Search code examples
rrep

Count comma delimited values and replicate a value equal times in R


Given the following example data ...

id                               Proteins
522     Q9UHC7-4;Q9UHC7-3;Q9UHC7-2;Q9UHC7
523                                Q9UHV7
524                       Q9Y6T7-2;Q9Y6T7
525                       Q9Y6T7-2;Q9Y6T7

... I would like to create a third column with each id times the number of semicolon delimited values of each row. More specifically something like that:

id                               Proteins     newCol
522     Q9UHC7-4;Q9UHC7-3;Q9UHC7-2;Q9UHC7    522;522;522;522
523                                Q9UHV7    523
524                       Q9Y6T7-2;Q9Y6T7    524;524
525                       Q9Y6T7-2;Q9Y6T7    525;525

I have tried this dt$newCol <- rep(dt$id, lengths(str_split(dt$Proteins, ";"))) but doesn't work since it creates a longer list.


Solution

  • Something like this?

    library(stringr)
    df$newCol <- str_replace_all(df$Proteins, "[^;]+", as.character(df$id))
    

    Output

    > df
       id                          Proteins          newCol
    1 522 Q9UHC7-4;Q9UHC7-3;Q9UHC7-2;Q9UHC7 522;522;522;522
    2 523                            Q9UHV7             523
    3 524                   Q9Y6T7-2;Q9Y6T7         524;524
    4 525                   Q9Y6T7-2;Q9Y6T7         525;525
    

    Another Base R solution suggested by @markus

    df1$new <- Map(gsub, pattern = "[^;]+", replacement = df1$id, x = df1$Proteins)