I have pairs of very large data.frames df1 & df2 (>500,000 rows) with different number of rows, both containing the same 4 columns (X,Y,Z coordinates and a count attribute). Doomie example:
df1<-data.frame(x=c(3,5,2,4),y=c(8,5,7,6),z=c(13,15,12,10),
count=c(10,20,4,12))
df2<-data.frame(x=c(4,3,6),y=c(6,9,8),z=c(10,13,15),count=c(4,7,3))
I want to subtract the count column (df1$count - df2$count
) only for rows matching XYZ (same spatial point). I found a way to do it by using the function merge() {base}
, but it is slow and the df's are very large.
Any tip how could I make it any faster??? Should I try to introduce a parallel processing? Any tip how to do it in parallel with such example without the need to cut the df's in chunks??
Thanks.
My implementation:
df3<-merge(df1,df2,by.x=c("x","y", "z"),by.y=c("x","y", "z"),all.x=T,all.y=TRUE)
df3[is.na(df3$count.x),4]<-0
df3[is.na(df3$count.y),5]<-0
df3$countdif<-df3$count.y-df3$count.x
NEW EDIT. ANSWER: The 2 suggestions in the answer by Akrun worked fine. The first one turned out 2x faster in the microbenchmark test, and is working also for my large data frames. here the benchmark of them:
Using dplyr {}
dodplyr<- function (a,b){
dfN<- full_join(a,b, by=c('x', 'y', 'z')) %>%
mutate_each(funs(replace(., which(is.na(.)), 0)),
starts_with('count')) %>%
mutate(countdif= count.y-count.x)
dfN<-select(dfN,-count.x,-count.y)
return(dfN)
}
and Using data.table {}
dodata.table<-function(a,b){
setDT(a)
setDT(b)
DT <- merge(a,b, by=c('x', 'y', 'z'), all=TRUE)
for(j in 4:5){set(DT, i=which(is.na(DT[[j]])), j=j, value=0)}
DT[, countdif:= count.y-count.x]
DT[,c("count.x","count.y"):=NULL]
return(DT)
}
And the microbenchmark:
times <- microbenchmark( dodplyr(df1,df2), dodata.table(df1,df2), times=1e3)
> times
Unit: milliseconds
expr min lq mean median uq max neval
dodplyr(df1, df2) 2.374164 2.489710 2.978814 2.590829 2.704017 18.15356 1000
dodata.table(df1, df2) 5.094271 5.308994 6.458764 5.534259 5.675328 37.23370 1000
HOWEVER I could not compare them with my implementation using merge{base} with dfs. I tried to include it but I get errors when calling the microbenchmark. Here what I tried:
domerge<- function(a,b){
dfm<-merge(a,b,by.x=c("x","y", "z"),by.y=c("x","y", "z"),all.x=T,all.y=TRUE)
dfm[is.na(dfm$count.x),4]<-0
dfm[is.na(dfm$count.y),5]<-0
dfm$countdif<-dfm$count.y-dfm$count.x
dfm<-dfm[,c(1:3,6)]
return(dfm)
}
That works when calling it e.g. df3<-domerge(df1,df2)
but it gives error when microbenchmarking:
> times <- microbenchmark(domerge(df1,df2), dodplyr(df1,df2), dodata.table(df1,df2), times=1e3)
Show Traceback
Rerun with Debug
Error in merge.data.table(a, b, by.x = c("x", "y", "z"), by.y = c("x", :
Can not match keys in x and y to automatically determine appropriate `by` parameter. Please set `by` value explicitly.
I am guessing the full_join
from dplyr
would be faster with respect to merge
(not tested though). After we are done with the full_join
, the 'NAs' in the 'count' columns are replace
d by '0' using mutate_each
and then we create the 'countdif' column using mutate
library(dplyr)
dfN <- full_join(df1,df2, by=c('x', 'y', 'z')) %>%
mutate_each(funs(replace(., which(is.na(.)), 0)),
starts_with('count')) %>%
mutate(countdif= count.y-count.x)
dfN
# x y z count.x count.y countdif
#1 3 8 13 10 0 -10
#2 5 5 15 20 0 -20
#3 2 7 12 4 0 -4
#4 4 6 10 12 4 -8
#5 3 9 13 0 7 7
#6 6 8 15 0 3 3
Or a possible approach using merge.data.table
from data.table
. We convert both the 'data.frame' to 'data.table' (setDT(df1)
, setDT(df2)
). Then, use merge.data.table
to do the full-join
. We replace the 'NA' values in the 'count' columns to 0 (here it is 4th and 5th column) in a for
loop with set
(set
is very efficient as it doesn't have the .[data.table
overhead. We assign (:=
) new column 'countdif` as the difference between 'count.y' and 'count.x'
library(data.table)
setDT(df1)
setDT(df2)
DT <- merge(df1, df2, by=c('x', 'y', 'z'), all=TRUE)
for(j in 4:5){
set(DT, i=which(is.na(DT[[j]])), j=j, value=0)
}
DT[, countdif:= count.y-count.x]
DT
# x y z count.x count.y countdif
#1: 2 7 12 4 0 -4
#2: 3 8 13 10 0 -10
#3: 3 9 13 0 7 7
#4: 4 6 10 12 4 -8
#5: 5 5 15 20 0 -20
#6: 6 8 15 0 3 3