Search code examples
rtidyr

Is there a R function to pair elements from two columns of a data.frame to make a new data.frame


I have several thousand rows like this stored in a object in R. I want to make a new df2 by pairing elements from two columns ($exonStarts and $exonEnds) from the df below

df <- structure(list(bin = c(0L, 0L, 0L, 0L, 0L, 0L), name = c("XM_011541469.2", 
"XM_017001276.2", "XM_011541467.2", "NM_001276352.2", "NM_001276351.2", 
"XM_011541465.3"), chr = c("chr1", "chr1", "chr1", "chr1", "chr1", 
"chr1"), strand = c("-", "-", "-", "-", "-", "-"), txStart = c(67092164L, 
67092164L, 67092164L, 67092164L, 67092164L, 67092164L), txEnd = c(67109072L, 
67131227L, 67131227L, 67134970L, 67134970L, 67134970L), cdsStart = c(67093004L, 
67093004L, 67093004L, 67093579L, 67093004L, 67093004L), cdsEnd = c(67103382L, 
67127240L, 67127240L, 67127240L, 67127240L, 67127240L), exonCount = c(5L, 
9L, 9L, 9L, 8L, 9L), exonStarts = c("67092164,67095234,67096251,67103237,67109028,", 
"67092164,67095234,67096251,67103237,67111576,67115351,67125751,67127165,67131141,", 
"67092164,67095234,67096251,67103237,67111576,67115351,67125751,67127165,67131141,", 
"67092164,67096251,67103237,67111576,67115351,67125751,67127165,67131141,67134929,", 
"67092164,67095234,67096251,67115351,67125751,67127165,67131141,67134929,", 
"67092164,67095234,67096251,67103237,67111576,67115351,67125751,67127165,67134929,"
), exonEnds = c("67093604,67095421,67096321,67103382,67109072,", 
"67093604,67095421,67096321,67103382,67111644,67115464,67125909,67127257,67131227,", 
"67093604,67095421,67096321,67103343,67111644,67115464,67125909,67127257,67131227,", 
"67093604,67096321,67103382,67111644,67115464,67125909,67127257,67131227,67134970,", 
"67093604,67095421,67096321,67115464,67125909,67127257,67131227,67134970,", 
"67093604,67095421,67096321,67103382,67111644,67115464,67125909,67127257,67134970,"
), score = c(0L, 0L, 0L, 0L, 0L, 0L), gene = c("C1orf141", "C1orf141", 
"C1orf141", "C1orf141", "C1orf141", "C1orf141"), cdsStartStat = c("cmpl", 
"cmpl", "cmpl", "cmpl", "cmpl", "cmpl"), cdsEndStat = c("cmpl", 
"cmpl", "cmpl", "cmpl", "cmpl", "cmpl"), exonFrames = c("0,2,1,0,-1,", 
"0,2,1,0,1,2,0,0,-1,", "0,2,1,0,1,2,0,0,-1,", "2,1,0,1,2,0,0,-1,-1,", 
"0,2,1,2,0,0,-1,-1,", "0,2,1,0,1,2,0,0,-1,")), row.names = c(NA, 
6L), class = "data.frame")
 

to produce an output like below. The first row of the input df should produce 5 rows of df2 in the format below by paring each element of $exonStart with each element of $exonEnd (the element counts will identical and is present in the column $exonCount).

$chr    $exonStart  $exonEnd
chr1    67092164    67093604
chr1    67095234    67095421
chr1    67096251    67096321
chr1    67103237    67103382
chr1    67109028    67109072

This needs to iterate over all rows of input df and all results combined into df2. So, in total there would be sum(object$exonCount) rows in df2.

I know that there would be some sort of strsplit function in here to split $exonStarts and $exonEnds and then combine them. Maybe an apply function to do this over all rows? I tried something like this first

map2(unlist(strsplit(df$exonStarts[1], ",")), unlist(strsplit(dft$exonEnds[1], ",")), c)

to produced a list of paired elements

[[1]]
[1] "67092164" "67093604"

[[2]]
[1] "67095234" "67095421"

[[3]]
[1] "67096251" "67096321"

[[4]]
[1] "67103237" "67103382"

[[5]]
[1] "67109028" "67109072"

But I do not know where to go from here. Any help would be appreciated.


Solution

  • You could use separate_longer_delim from tidyr

    library(tidyr)
    library(dplyr)
    
    select(df, chr, exonStarts, exonEnds) %>%
      separate_longer_delim(c(exonStarts, exonEnds), delim=",")
    

    Which gives (showing first 10 rows only):

        chr exonStarts exonEnds
    1  chr1   67092164 67093604
    2  chr1   67095234 67095421
    3  chr1   67096251 67096321
    4  chr1   67103237 67103382
    5  chr1   67109028 67109072
    6  chr1                    
    7  chr1   67092164 67093604
    8  chr1   67095234 67095421
    9  chr1   67096251 67096321
    10 chr1   67103237 67103382
    

    The blank rows are due to the trailing commas in the original columns, which can easily be removed using filter. If you do that, you will see that there are 49 rows remaining, the same as sum(df$exonCount).


    As @TarJae mentions, the number of items in each column must be equal, or an error will result. Presumably you are confident that this situation isn't possible.

    df$exonEnds[1] <- "67093604,67095421,67096321,67103382,"
    
    select(df, chr, exonStarts, exonEnds) %>%
      separate_longer_delim(c(exonStarts, exonEnds), delim=",")
    
    #Error in `separate_longer_delim()`:
    #! In row 1, can't recycle input of size 6 to size 5.
    #Run `rlang::last_trace()` to see where the error occurred.
    

    You'll also get an error using your strsplit approach.