Search code examples
rloopsarray-difference

most efficient way to find differences?


Very simple question and I regret having to ask it on SO. I'm hoping a creative mind out there has a better solution than I have thought of.

I want to find the differences between adjacent values of a matrix where the value for "type" is the same. For the below example, I would want a vector with the values 2,6,1.

mat
     value type
[1,]  5   A
[2,]  7   A
[3,]  1   B
[4,]  2   C
[5,]  8   C
[6,]  9   C

I have tried implementing it in two ways, but both of them are quite slow:

  • Approach 1: If type(row 1) = type(row 2), then find value(row 2) - value(row 1), then increment by one. If type(row 1) != type(row 2), then increment by 2.

  • Approach 2: Loop through each different type and find all differences.

There are about 500,000 different "types" and 5 million different rows. Can anyone think of a more efficient way? I am programming in the language R, where there is already a function to find the differences for me (see: ?diff).


Solution

  • Since you say you have too many rows to do this, I'll suggest a data.table solution:

    require(data.table)
    DT <- data.table(df) # where `df` is your data.frame
    
    DT[, diff(value), by=type]$V1
    # [1] 2 6 1
    

    Simulating this code on a data of your dimensions:

    It takes close to 20 seconds (bottleneck should be calls to diff) on data of your dimensions.

    require(data.table)
    set.seed(45)
    types <- sapply(1:5e5, function(x) paste0(sample(letters, 5, TRUE), collapse=""))
    
    DT <- data.table(value=sample(100, 5e6, TRUE), type=sample(types, 5e6, TRUE))
    system.time(t1 <- DT[, diff(value), by=type]$V1)
    #   user  system elapsed 
    # 18.610   0.238  19.166 
    

    To compare against the other answer with tapply:

    system.time(t2 <- tapply(DT[["value"]], DT[["type"]], diff))
    #   user  system elapsed 
    # 48.471   0.664  51.673 
    

    Also, tapply orders the results by type where as data.table without key will preserve the original order.


    Edit: Following @eddi's comment:

    > system.time(t3 <- DT[, value[-1]-value[-.N], by=type]$V1)
    #  user  system elapsed 
    # 6.221   0.195   6.641 
    

    There's a 3x improvement by removing the call to diff. Thanks @eddi.