Search code examples
rregexstringstrsplit

Split columns to rows based off a string


I have three columns that have multiple values but contain different strings thereafter. These values occur after an ";" but the strings are different. I want to be able to take the following column and turn it into two. There can be more than two values as well, I just used two for this example.

I have this in a row;

Cola <- c("Qxxxx7; Pxxxx8")
Colb <- c("Qxxxx7 [1-24]; Pxxxx8 [1-24]")
Colc <- c("Qxxxx7 1xAcetyl [N-Term]; 1xPhospho [S4(100)]; Pxxxx8 1xAcetyl [N-Term]; 1xPhospho [S10(100)]")

I need them to separated to 2 rows.

  • Row 1:
Cola <- c("Qxxxx7")
Colb <- c("Qxxxx7 [1-24]")
Colc <- c("Qxxxx7 1xAcetyl [N-Term]; 1xPhospho [S4(100)]")
  • Row 2
Cola <- c("Pxxxx8")
Colb <- c("Pxxxx8 [1-24]")
Colc <- c("Pxxxx8 1xAcetyl [N-Term]; 1xPhospho [S10(100)]")

Solution

  • do.call(cbind, sapply(df1, function(x) strsplit(x, split = "; (?=Pxx)", perl = T)))
    #>      Cola        Colb               Colc                                               
    #> [1,] "Qxxxx7" "Qxxxx7 [1-24]" "Qxxxx7 1xAcetyl [N-Term]; 1xPhospho [S4(100)]" 
    #> [2,] "Pxxxx8" "Pxxxx8 [1-24]" "Pxxxx8 1xAcetyl [N-Term]; 1xPhospho [S10(100)]"
    

    If you have other columns, tidyr::separate_rows makes it much easier:

    library(tidyr)
    
    df1 %>% separate_rows(Cola, Colb, Colc, sep = "; (?=Pxx)")
    

    Created on 2023-12-06 with reprex v2.0.2

    Data:

    Cola  <- c("Qxxxx7; Pxxxx8")
    Colb  <- c("Qxxxx7 [1-24]; Pxxxx8 [1-24]")
    Colc  <- c("Qxxxx7 1xAcetyl [N-Term]; 1xPhospho [S4(100)]; Pxxxx8 1xAcetyl [N-Term]; 1xPhospho [S10(100)]")
    
    df1 <- data.frame(Cola , Colb, Colc)