Search code examples
rdataframedplyrdifferencesubtraction

Subtract/find the difference between 2 data frames with different numbers of columns


I'm relatively new to R and haven't been able to find anywhere that answers this query.

I have 2 dataframes with the same number of rows but different numbers of columns. I want to subtract the values in matching columns to identify the difference between the 2 dataframes.

For example, these 2 dataframes resemble the ones I am working with:

df1<-data.frame(Measure=rep("test",3),Filename=c("filename1","filename2","filename3"),Op1=c(79,72,95),Op2=c(NA,NA,5),Op3=c(75,64,66),Op4=c(86,71,58))

   Filename  Op1  Op2  Op3  Op4
1  filename1  79   NA   75   86
2  filename2  72   NA   64   71
3  filename3  95   5    66   58

df2<-data.frame(Measure=rep("test",3),Filename=c("filename1","filename2","filename3"),Op1=c(9,NA,5),Op4=c(80,70,50))

   Filename  Op1  Op4
1  filename1   9   80
2  filename2   NA  70
3  filename3   5   50

Currently I have a function that melts the 2 data frames and sums the data that looks like this:

CalcFunSum<-function(MeasureName,BoxNumbers){
  temp<-data.frame()
  for (i in BoxNumbers){
    data<-melt(BoxNumbers[i])
    temp<-temp %>% bind_rows(data)
  }
  temp<-cbind(Measure = MeasureName,dcast(temp,Filename~variable,sum,fill = 0))
  temp
}

So CalcFunSum(test,c(df1,df2)) will add the 2 data frames together and produce

  Measure  Filename  Op1  Op2  Op3  Op4
1  test   filename1  88   NA   75   166
2  test   filename2  72   NA   64   141
3  test   filename3  100  5    66   108

What I want is something similar to perform the calculation df1-df2 to get:

  Measure  Filename  Op1  Op2  Op3  Op4
1  test   filename1  70   NA   75   6
2  test   filename2  72   NA   64   1
3  test   filename3  90   5    66   8

I've tried replacing the sum in the function with diff but that didn't work

Any ideas how to go about this?

edit - I realised the function included reference to a list I hold these data frames in and changed it.

Follow up: dealing with NA values

So the answers so far work, but on testing with my actual data I have noticed that in cases where df2 has an NA but df1 has a value the resulting output contains NA and not the value in df1. I'm going to change one of the values in df2 to NA to reflect this.

In the current answers by @akrun and @IceCreamToucan the output would be

  Measure  Filename  Op1  Op2  Op3  Op4
1  test   filename1  70   NA   75   6
2  test   filename2  NA   NA   64   1
3  test   filename3  90   5    66   8

I presume this is an NA.rm = T somewhere in the code, or I need to deal with the NA values earlier in the process, but it would be useful to know if there are tweaks to the answers that could sort this out.


Solution

  • Here is one option with a join using data,table. Get the column names common in both dataset (intersect) and remove the names that are not needed for comparison (setdiff)

    library(data.table)
    nm1 <- setdiff(intersect(names(df1), names(df2)), c("Measure", "Filename"))
    

    Then do a join on the 'Measure', 'Filename', get the values of the columns ('nm1') from 'df1' and the corresponding columns in 'df2'. Here, it would be i. as 'df2' is in the ith position (a data.table follows the same format [i, j, by]. By using mget, it returns the columns in a list, we get the difference (-) of both sets of columns with Map and update by assigning (:=) the values which would reflect in the original dataset ('df1')

    setDT(df1)[df2, (nm1) := Map(`-`, mget(nm1),
               mget(paste0("i.", nm1))), on = .(Measure, Filename)]
    df1
    #   Measure  Filename Op1 Op2 Op3 Op4
    #1:    test filename1  70  NA  75   6
    #2:    test filename2  70  NA  64   1
    #3:    test filename3  90   5  66   8
    

    Update

    Any value compared with NA returns NA, Similarly,

    72-NA
    #[1] NA
    

    To avoid this issue, we can replace the NA with 0 and then do the difference

    setDT(df1)[df2, (nm1) := Map(function(x, y) replace(x, is.na(x), 0) - 
         replace(y, is.na(y), 0), 
        mget(nm1),mget(paste0("i.", nm1))), on = .(Measure, Filename)]
    
    df1
    #   Measure  Filename Op1 Op2 Op3 Op4
    #1:    test filename1  70  NA  75   6
    #2:    test filename2  72  NA  64   1
    #3:    test filename3  90   5  66   8