Was not sure how to title this question so if there are better suggestions please edit
Let's say we have this dataframe:
Dataset
df <- data.frame(start = c(10, 20), end = c(15,33), label = c('ex1','ex2'))
Which looks like this:
start end label
1 10 15 ex1
2 20 33 ex2
What I want to get
I want to expand from start
--> end
, like so:
pos label
1 10 ex1
2 11 ex1
3 12 ex1
4 13 ex1
5 14 ex1
6 15 ex1
7 20 ex2
8 21 ex2
9 22 ex2
10 23 ex2
11 24 ex2
12 25 ex2
13 26 ex2
14 27 ex2
15 28 ex2
16 29 ex2
17 30 ex2
18 31 ex2
19 32 ex2
20 33 ex2
What I have now
f <- function(x) {data.frame(pos = x$start:x$end, label = x$label)}
df %>% rowwise() %>% do(f(.))
While my solution works, my original dataset is much larger and doubt if this is efficient. Moreover, I want to include more columns than label
so I want to retrain all columns and just spread out the start
and end
I have a data.table
solution in mind.
I made the hypothesis that your label
var is unique by observation. Otherwise, you should use a row number to group your data.
library(data.table)
df <- data.frame(start = c(10, 20), end = c(15,33), label = c('ex1','ex2'))
setDT(df)
df[, seq(.SD[['start']], .SD[['end']]), by = label]
label V1
1: ex1 10
2: ex1 11
3: ex1 12
4: ex1 13
5: ex1 14
6: ex1 15
7: ex2 20
8: ex2 21
9: ex2 22
10: ex2 23
11: ex2 24
12: ex2 25
13: ex2 26
14: ex2 27
15: ex2 28
16: ex2 29
17: ex2 30
18: ex2 31
19: ex2 32
20: ex2 33
In terms of efficiency, it might be hard to find a solution faster than data.table
that is designed to that end.
If you can't use label
as a unique identifier, you can do
df[,'rn' := seq(.N)]
df[, seq(.SD[['start']], .SD[['end']]), by = c('rn','label')]
rn label V1
1: 1 ex1 10
2: 1 ex1 11
3: 1 ex1 12
4: 1 ex1 13
5: 1 ex1 14
6: 1 ex1 15
7: 2 ex2 20
8: 2 ex2 21
9: 2 ex2 22
10: 2 ex2 23
11: 2 ex2 24
12: 2 ex2 25
13: 2 ex2 26
14: 2 ex2 27
15: 2 ex2 28
16: 2 ex2 29
17: 2 ex2 30
18: 2 ex2 31
19: 2 ex2 32
20: 2 ex2 33
and you can drop the intermediate row number using df[,'rn' := NULL]
data.table
brings a good speedup (does not matter that much if you use one or two columns to group in this example)
Unit: microseconds
expr min lq mean median uq
df %>% rowwise() %>% do(f(.)) 1549.408 1808.669 2309.332 2292.525 2555.888
df[, seq(.SD[["start"]], .SD[["end"]]), by = "label"] 1011.608 1302.249 1555.808 1490.542 1779.543
df[, seq(.SD[["start"]], .SD[["end"]]), by = c("label", "rn")] 968.124 1095.703 1387.556 1253.023 1592.483
max neval cld
7141.964 100 b
3061.487 100 a
2953.598 100 a
If you want to go even faster, you can set a key (?setkeyv
). If your dataframe is of significant size, this might bring huge performance gains (in this small example it won't)