Search code examples
rkeydifference

How to Keep the id of a dataset when calculating the difference between two datasets within a function in R


I have a function that calculates the difference between rows (based on the same columns) in 2 datasets. I want to keep the id after the calculation because i will need it after to merge with another table. I actually have no idea how to do this step. Here is the data and the function.

# data frame for recipients
IDr= c(seq(1,4))
Blood_type_r=c("A","B","AB","O")
data_R=data.frame(IDr,Blood_type_r,A=rep(0,4),B=c(rep(0,3),1),C=c(rep(1,3),0),D=rep(1,4),E=c(rep(0,2),rep(1,1),0),stringsAsFactors=FALSE)

  data_R
  IDr Blood_type_r A B C D E
1   1            A 0 0 1 1 0
2   2            B 0 0 1 1 0
3   3           AB 0 0 1 1 1
4   4            O 0 1 0 1 0
# data frame for donors 
IDd= c(seq(1,8))
Blood_type_d= c(rep("A", each=2),rep("B", each=2),rep("AB", each=2),rep("O", each=2))
WD= c(rep(0.25, each=2),rep(0.125, each=2),rep(0.125, each=2),rep(0.5, each=2))
data_D=data.frame(IDd,Blood_type_d,A=c(rep(0,6),1,1),B=c(rep(0,6),1,1),C=c(rep(1,7),0),D=rep(1,8),E=c(rep(0,6),rep(1,1),0),WD,stringsAsFactors=FALSE)
  data_D
  IDd Blood_type_d A B C D E    WD
1   1            A 0 0 1 1 0 0.250
2   2            A 0 0 1 1 0 0.250
3   3            B 0 0 1 1 0 0.125
4   4            B 0 0 1 1 0 0.125
5   5           AB 0 0 1 1 0 0.125
6   6           AB 0 0 1 1 0 0.125
7   7            O 1 1 1 1 1 0.500
8   8            O 1 1 0 1 0 0.500

# function
soustraction.i=function(D,R,i,threshold){
  D=as.data.frame(D)
  R=as.data.frame(R)
  dif=map2_df(D, R[i,], `-`)
  dif[dif<0] = 0
  dif$mismatch=rowSums(dif)
  dif=dif[which(dif$mismatch <= threshold),]
  return(dif)
  
}

 soustraction.i(data_D[,3:7],data_R[,3:7],1,3)
# A tibble: 8 x 6
      A     B     C     D     E mismatch
  <dbl> <dbl> <dbl> <dbl> <dbl>    <dbl>
1     0     0     0     0     0        0
2     0     0     0     0     0        0
3     0     0     0     0     0        0
4     0     0     0     0     0        0
5     0     0     0     0     0        0
6     0     0     0     0     0        0
7     1     1     0     0     1        3
8     1     1     0     0     0        2

I would like to have the output like that (keeping the IDd for the donor), but i do not know how to do it since my 2 datasets must have the same number of columns when i passed it as arguments. For example if i set the threshold at 3 i should have all the IDd from the donor table.

    IDd    A     B     C     D     E mismatch

1   1      0     0     0     0     0        0
2   2      0     0     0     0     0        0
3   3      0     0     0     0     0        0
4   4      0     0     0     0     0        0
5   5      0     0     0     0     0        0
6   6      0     0     0     0     0        0
7   7      1     1     0     0     1        3
8   8      1     1     0     0     0        2

Any help is appreciated, Thank you.


Solution

  • To have Id column in output you should first pass it in the input. Try this function :

    soustraction.i=function(D,R,i,threshold){
      D=as.data.frame(D)
      R=as.data.frame(R)
      dif=purrr::map2_df(D[-1], R[i,], `-`)
      dif[dif<0] = 0
      dif$mismatch=rowSums(dif)
      dif= cbind(ID = D[1], dif)
      dif=dif[which(dif$mismatch <= threshold),]
      return(dif)
    }
    
    soustraction.i(data_D[,c(1, 3:7)],data_R[,3:7],1,3)
    
    #  IDd A B C D E mismatch
    #1   1 0 0 0 0 0        0
    #2   2 0 0 0 0 0        0
    #3   3 0 0 0 0 0        0
    #4   4 0 0 0 0 0        0
    #5   5 0 0 0 0 0        0
    #6   6 0 0 0 0 0        0
    #7   7 1 1 0 0 1        3
    #8   8 1 1 0 0 0        2
    
    soustraction.i(data_D[,c(1, 3:7)],data_R[,3:7],1,2)
    #  IDd A B C D E mismatch
    #1   1 0 0 0 0 0        0
    #2   2 0 0 0 0 0        0
    #3   3 0 0 0 0 0        0
    #4   4 0 0 0 0 0        0
    #5   5 0 0 0 0 0        0
    #6   6 0 0 0 0 0        0
    #8   8 1 1 0 0 0        2
    

    Note that I have assumed Id column to be the first column in data_D.