I'm relatively new to R and haven't been able to find anywhere that answers this query.
I have 2 dataframes with the same number of rows but different numbers of columns. I want to subtract the values in matching columns to identify the difference between the 2 dataframes.
For example, these 2 dataframes resemble the ones I am working with:
df1<-data.frame(Measure=rep("test",3),Filename=c("filename1","filename2","filename3"),Op1=c(79,72,95),Op2=c(NA,NA,5),Op3=c(75,64,66),Op4=c(86,71,58))
Filename Op1 Op2 Op3 Op4
1 filename1 79 NA 75 86
2 filename2 72 NA 64 71
3 filename3 95 5 66 58
df2<-data.frame(Measure=rep("test",3),Filename=c("filename1","filename2","filename3"),Op1=c(9,NA,5),Op4=c(80,70,50))
Filename Op1 Op4
1 filename1 9 80
2 filename2 NA 70
3 filename3 5 50
Currently I have a function that melts the 2 data frames and sums the data that looks like this:
CalcFunSum<-function(MeasureName,BoxNumbers){
temp<-data.frame()
for (i in BoxNumbers){
data<-melt(BoxNumbers[i])
temp<-temp %>% bind_rows(data)
}
temp<-cbind(Measure = MeasureName,dcast(temp,Filename~variable,sum,fill = 0))
temp
}
So CalcFunSum(test,c(df1,df2))
will add the 2 data frames together and produce
Measure Filename Op1 Op2 Op3 Op4
1 test filename1 88 NA 75 166
2 test filename2 72 NA 64 141
3 test filename3 100 5 66 108
What I want is something similar to perform the calculation df1-df2 to get:
Measure Filename Op1 Op2 Op3 Op4
1 test filename1 70 NA 75 6
2 test filename2 72 NA 64 1
3 test filename3 90 5 66 8
I've tried replacing the sum
in the function with diff
but that didn't work
Any ideas how to go about this?
edit - I realised the function included reference to a list I hold these data frames in and changed it.
Follow up: dealing with NA values
So the answers so far work, but on testing with my actual data I have noticed that in cases where df2 has an NA but df1 has a value the resulting output contains NA and not the value in df1. I'm going to change one of the values in df2 to NA to reflect this.
In the current answers by @akrun and @IceCreamToucan the output would be
Measure Filename Op1 Op2 Op3 Op4
1 test filename1 70 NA 75 6
2 test filename2 NA NA 64 1
3 test filename3 90 5 66 8
I presume this is an NA.rm = T somewhere in the code, or I need to deal with the NA values earlier in the process, but it would be useful to know if there are tweaks to the answers that could sort this out.
Here is one option with a join using data,table
. Get the column names common in both dataset (intersect
) and remove the names that are not needed for comparison (setdiff
)
library(data.table)
nm1 <- setdiff(intersect(names(df1), names(df2)), c("Measure", "Filename"))
Then do a join on
the 'Measure', 'Filename', get the values of the columns ('nm1') from 'df1' and the corresponding columns in 'df2'. Here, it would be i.
as 'df2' is in the i
th position (a data.table follows the same format [i, j, by]
. By using mget
, it returns the columns in a list
, we get the difference (-
) of both sets of columns with Map
and update by assigning (:=
) the values which would reflect in the original dataset ('df1')
setDT(df1)[df2, (nm1) := Map(`-`, mget(nm1),
mget(paste0("i.", nm1))), on = .(Measure, Filename)]
df1
# Measure Filename Op1 Op2 Op3 Op4
#1: test filename1 70 NA 75 6
#2: test filename2 70 NA 64 1
#3: test filename3 90 5 66 8
Any value compared with NA returns NA, Similarly,
72-NA
#[1] NA
To avoid this issue, we can replace
the NA
with 0 and then do the difference
setDT(df1)[df2, (nm1) := Map(function(x, y) replace(x, is.na(x), 0) -
replace(y, is.na(y), 0),
mget(nm1),mget(paste0("i.", nm1))), on = .(Measure, Filename)]
df1
# Measure Filename Op1 Op2 Op3 Op4
#1: test filename1 70 NA 75 6
#2: test filename2 72 NA 64 1
#3: test filename3 90 5 66 8