Search code examples
rdataframesubtraction

Subtracting data.frames and data.tables of different sizes


Two data.frames of same dimensions can be subtracted in R using

df1 - df2

But I want to subtract two data.frames of different dimensions like

df1 <- data.frame(V1=1:5)
df2 <- data.frame(V1=1:5, V2=6:10)

df1-df2

Error in Ops.data.frame(df1, df2) : 
  ‘-’ only defined for equally-sized data frames

This subtracting can be achieved using for loop but I'm looking for any already established function. Thanks

Edited

How about if have to subtract two data.tables of different dimensions?

library(data.table)
dt1 <- data.table(V1=1:5)
dt2 <- data.table(V1=1:5, V2=6:10)

dt1-dt2

Error in `[.data.table`(dt1, row(dt2), ) : 
  i is invalid type (matrix). Perhaps in future a 2 column matrix could return a list of elements of DT (in the spirit of A[B] in FAQ 2.14). Please let datatable-help know if you'd like this, or add your comments to FR #1611.
dt1[row(dt2),]-dt2

Solution

  • We could do this by making both the datasets have the same length so that we can compare element-by-element of each dataset. In the example given 'df1' has 1 column with 5 rows, where as for 'df2' it is 2 columns with 5 rows. The idea would be to make 'df1' having 10 elements or either 2 columns with 5 rows to match the dimensions of 'df2'. This can be easily done by rep or a convenient function is row.

      df1[row(df2),]-df2
    

    Just to make it more clear

     row(df2)
     #     [,1] [,2]
     #[1,]    1    1
     #[2,]    2    2
     #[3,]    3    3
     #[4,]    4    4
     #[5,]    5    5
    

    gives the row index for each row of 'df2'. By doing

     df1[row(df2),]
     #[1] 1 2 3 4 5 1 2 3 4 5
    

    we replicate each row element twice. Given that the datasets do this in columnwise, it is happening like below

     df1[c(row(df2)[,1],row(df2)[,2]),]
    

    This can be subtracted from df2

     df1[row(df2),]-df2
     #  V1 V2
     #1  0 -5
     #2  0 -5
     #3  0 -5
     #4  0 -5
     #5  0 -5
    

    As @David Arenburg mentioned, this would return wrong results for both datasets having multiple columns. So, if you are going to subtract a single column from 'df1' (having multiple columns) from a multiple column dataset ('df2'), then selecting that column and subtracting from 'df2' may be more general (Thanks to @David Arenburg's code)

     df1$V1-df2
     #  V1 V2
     #1  0 -5
     #2  0 -5
     #3  0 -5
     #4  0 -5
     #5  0 -5
    

    It works due to the recycling effect, i.e. 'V1' column elements will subtract from the first column of 'df2', then it will again start from the first element to start subtracting from the second column of 'df2' (assuming that both datasets have the same number or rows).


    For the second example with data.table (single column for 'dt1'), one option would be

    dt1[,rep(names(dt1), ncol(dt2)),with=FALSE]-dt2
    #   V1 V1
    #1:  0 -5
    #2:  0 -5
    #3:  0 -5
    #4:  0 -5
    #5:  0 -5