Search code examples
rdataframeexpand

Expand number range to the individual numbers


Was not sure how to title this question so if there are better suggestions please edit


Let's say we have this dataframe:

Dataset

df <- data.frame(start = c(10, 20), end = c(15,33), label = c('ex1','ex2'))

Which looks like this:

  start end label
1    10  15   ex1
2    20  33   ex2


What I want to get
I want to expand from start --> end, like so:

  pos label
1   10   ex1
2   11   ex1
3   12   ex1
4   13   ex1
5   14   ex1
6   15   ex1
7   20   ex2
8   21   ex2
9   22   ex2
10  23   ex2
11  24   ex2
12  25   ex2
13  26   ex2
14  27   ex2
15  28   ex2
16  29   ex2
17  30   ex2
18  31   ex2
19  32   ex2
20  33   ex2

What I have now

f <- function(x) {data.frame(pos = x$start:x$end, label = x$label)}
df %>% rowwise() %>% do(f(.))

While my solution works, my original dataset is much larger and doubt if this is efficient. Moreover, I want to include more columns than label so I want to retrain all columns and just spread out the start and end


Solution

  • I have a data.table solution in mind.

    I made the hypothesis that your label var is unique by observation. Otherwise, you should use a row number to group your data.

    library(data.table)
    df <- data.frame(start = c(10, 20), end = c(15,33), label = c('ex1','ex2'))
    setDT(df)
    
    df[, seq(.SD[['start']], .SD[['end']]), by = label]
    label V1
     1:   ex1 10
     2:   ex1 11
     3:   ex1 12
     4:   ex1 13
     5:   ex1 14
     6:   ex1 15
     7:   ex2 20
     8:   ex2 21
     9:   ex2 22
    10:   ex2 23
    11:   ex2 24
    12:   ex2 25
    13:   ex2 26
    14:   ex2 27
    15:   ex2 28
    16:   ex2 29
    17:   ex2 30
    18:   ex2 31
    19:   ex2 32
    20:   ex2 33
    

    In terms of efficiency, it might be hard to find a solution faster than data.table that is designed to that end.

    If you can't use label as a unique identifier, you can do

    df[,'rn' := seq(.N)]
    
    df[, seq(.SD[['start']], .SD[['end']]), by = c('rn','label')]
        rn label V1
     1:  1   ex1 10
     2:  1   ex1 11
     3:  1   ex1 12
     4:  1   ex1 13
     5:  1   ex1 14
     6:  1   ex1 15
     7:  2   ex2 20
     8:  2   ex2 21
     9:  2   ex2 22
    10:  2   ex2 23
    11:  2   ex2 24
    12:  2   ex2 25
    13:  2   ex2 26
    14:  2   ex2 27
    15:  2   ex2 28
    16:  2   ex2 29
    17:  2   ex2 30
    18:  2   ex2 31
    19:  2   ex2 32
    20:  2   ex2 33
    

    and you can drop the intermediate row number using df[,'rn' := NULL]

    Efficiency

    data.table brings a good speedup (does not matter that much if you use one or two columns to group in this example)

    Unit: microseconds
                                                               expr      min       lq     mean   median       uq
                                      df %>% rowwise() %>% do(f(.)) 1549.408 1808.669 2309.332 2292.525 2555.888
              df[, seq(.SD[["start"]], .SD[["end"]]), by = "label"] 1011.608 1302.249 1555.808 1490.542 1779.543
     df[, seq(.SD[["start"]], .SD[["end"]]), by = c("label", "rn")]  968.124 1095.703 1387.556 1253.023 1592.483
          max neval cld
     7141.964   100   b
     3061.487   100  a 
     2953.598   100  a 
    

    If you want to go even faster, you can set a key (?setkeyv). If your dataframe is of significant size, this might bring huge performance gains (in this small example it won't)