Very simple question and I regret having to ask it on SO. I'm hoping a creative mind out there has a better solution than I have thought of.
I want to find the differences between adjacent values of a matrix where the value for "type" is the same. For the below example, I would want a vector with the values 2,6,1.
mat
value type
[1,] 5 A
[2,] 7 A
[3,] 1 B
[4,] 2 C
[5,] 8 C
[6,] 9 C
I have tried implementing it in two ways, but both of them are quite slow:
Approach 1: If type(row 1) = type(row 2)
, then find value(row 2) - value(row 1)
, then increment by one. If type(row 1) != type(row 2)
, then increment by 2.
Approach 2: Loop through each different type
and find all differences.
There are about 500,000 different "types" and 5 million different rows. Can anyone think of a more efficient way? I am programming in the language R, where there is already a function to find the differences for me (see: ?diff).
Since you say you have too many rows to do this, I'll suggest a data.table
solution:
require(data.table)
DT <- data.table(df) # where `df` is your data.frame
DT[, diff(value), by=type]$V1
# [1] 2 6 1
It takes close to 20 seconds (bottleneck should be calls to diff
) on data of your dimensions.
require(data.table)
set.seed(45)
types <- sapply(1:5e5, function(x) paste0(sample(letters, 5, TRUE), collapse=""))
DT <- data.table(value=sample(100, 5e6, TRUE), type=sample(types, 5e6, TRUE))
system.time(t1 <- DT[, diff(value), by=type]$V1)
# user system elapsed
# 18.610 0.238 19.166
To compare against the other answer with tapply
:
system.time(t2 <- tapply(DT[["value"]], DT[["type"]], diff))
# user system elapsed
# 48.471 0.664 51.673
Also, tapply
orders the results by type
where as data.table
without key
will preserve the original order.
Edit: Following @eddi's comment:
> system.time(t3 <- DT[, value[-1]-value[-.N], by=type]$V1)
# user system elapsed
# 6.221 0.195 6.641
There's a 3x improvement by removing the call to diff
. Thanks @eddi.