Search code examples
rdataframeparallel-processinglarge-data

efficient way substracting two very large data frames with different number of rows (matching XYZ)


I have pairs of very large data.frames df1 & df2 (>500,000 rows) with different number of rows, both containing the same 4 columns (X,Y,Z coordinates and a count attribute). Doomie example:

df1<-data.frame(x=c(3,5,2,4),y=c(8,5,7,6),z=c(13,15,12,10),
      count=c(10,20,4,12))
df2<-data.frame(x=c(4,3,6),y=c(6,9,8),z=c(10,13,15),count=c(4,7,3))

I want to subtract the count column (df1$count - df2$count) only for rows matching XYZ (same spatial point). I found a way to do it by using the function merge() {base}, but it is slow and the df's are very large.

Any tip how could I make it any faster??? Should I try to introduce a parallel processing? Any tip how to do it in parallel with such example without the need to cut the df's in chunks??

Thanks.

My implementation:

df3<-merge(df1,df2,by.x=c("x","y", "z"),by.y=c("x","y", "z"),all.x=T,all.y=TRUE)
df3[is.na(df3$count.x),4]<-0
df3[is.na(df3$count.y),5]<-0
df3$countdif<-df3$count.y-df3$count.x

NEW EDIT. ANSWER: The 2 suggestions in the answer by Akrun worked fine. The first one turned out 2x faster in the microbenchmark test, and is working also for my large data frames. here the benchmark of them:

Using dplyr {}

dodplyr<- function (a,b){
    dfN<- full_join(a,b, by=c('x', 'y', 'z')) %>%
        mutate_each(funs(replace(., which(is.na(.)), 0)), 
                starts_with('count')) %>%  
        mutate(countdif= count.y-count.x)
    dfN<-select(dfN,-count.x,-count.y)
    return(dfN)
}

and Using data.table {}

dodata.table<-function(a,b){
    setDT(a)
    setDT(b)
    DT <- merge(a,b, by=c('x', 'y', 'z'), all=TRUE)
    for(j in 4:5){set(DT, i=which(is.na(DT[[j]])), j=j, value=0)}
    DT[, countdif:= count.y-count.x]
    DT[,c("count.x","count.y"):=NULL]
    return(DT)
}

And the microbenchmark:

times <- microbenchmark( dodplyr(df1,df2), dodata.table(df1,df2), times=1e3)
> times
Unit: milliseconds
                   expr      min       lq     mean   median       uq      max neval
      dodplyr(df1, df2) 2.374164 2.489710 2.978814 2.590829 2.704017 18.15356  1000
 dodata.table(df1, df2) 5.094271 5.308994 6.458764 5.534259 5.675328 37.23370  1000

HOWEVER I could not compare them with my implementation using merge{base} with dfs. I tried to include it but I get errors when calling the microbenchmark. Here what I tried:

domerge<- function(a,b){
  dfm<-merge(a,b,by.x=c("x","y", "z"),by.y=c("x","y", "z"),all.x=T,all.y=TRUE)
  dfm[is.na(dfm$count.x),4]<-0
  dfm[is.na(dfm$count.y),5]<-0
  dfm$countdif<-dfm$count.y-dfm$count.x
  dfm<-dfm[,c(1:3,6)]
  return(dfm)
}

That works when calling it e.g. df3<-domerge(df1,df2) but it gives error when microbenchmarking:

> times <- microbenchmark(domerge(df1,df2), dodplyr(df1,df2), dodata.table(df1,df2), times=1e3)

Show Traceback

 Rerun with Debug
 Error in merge.data.table(a, b, by.x = c("x", "y", "z"), by.y = c("x",  : 
  Can not match keys in x and y to automatically determine appropriate `by` parameter. Please set `by` value explicitly.

Solution

  • I am guessing the full_join from dplyr would be faster with respect to merge (not tested though). After we are done with the full_join, the 'NAs' in the 'count' columns are replaced by '0' using mutate_each and then we create the 'countdif' column using mutate

    library(dplyr)
    dfN <- full_join(df1,df2, by=c('x', 'y', 'z')) %>%
               mutate_each(funs(replace(., which(is.na(.)), 0)), 
                               starts_with('count')) %>%  
               mutate(countdif= count.y-count.x) 
    dfN
    #  x y  z count.x count.y countdif
    #1 3 8 13      10       0      -10
    #2 5 5 15      20       0      -20
    #3 2 7 12       4       0       -4
    #4 4 6 10      12       4       -8
    #5 3 9 13       0       7        7
    #6 6 8 15       0       3        3
    

    Or a possible approach using merge.data.table from data.table. We convert both the 'data.frame' to 'data.table' (setDT(df1), setDT(df2)). Then, use merge.data.table to do the full-join. We replace the 'NA' values in the 'count' columns to 0 (here it is 4th and 5th column) in a for loop with set (set is very efficient as it doesn't have the .[data.table overhead. We assign (:=) new column 'countdif` as the difference between 'count.y' and 'count.x'

    library(data.table)
    setDT(df1)
    setDT(df2)
    DT <- merge(df1, df2, by=c('x', 'y', 'z'), all=TRUE)
    
    for(j in 4:5){
             set(DT, i=which(is.na(DT[[j]])), j=j, value=0)
      }
    DT[, countdif:= count.y-count.x]
    DT
    #   x y  z count.x count.y countdif
    #1: 2 7 12       4       0       -4
    #2: 3 8 13      10       0      -10
    #3: 3 9 13       0       7        7
    #4: 4 6 10      12       4       -8
    #5: 5 5 15      20       0      -20
    #6: 6 8 15       0       3        3