Search code examples
rdplyrmergepurrrsplit-apply-combine

Calculate z-score from another dataframe when meeting conditions in R


I have two dataframes:

Node1<-c(1, 2, 3, 4, 5, 6, 7)
Node1<-c(1, 1, 1, 2, 2, 3)
Node2<-c(2, 3, 4, 3, 4, 4)
Mean<-c(2, 1, 5, 3, 2, 3)
SD<-c(0.1, 0.2, 0.01, 0.2, 0.01, 0.01)
df1<-data.frame(Node1, Node2, Mean, SD)
df1
  Node1 Node2 Mean   SD
1     1     2    2 0.10
2     1     3    1 0.20
3     1     4    5 0.01
4     2     3    3 0.20
5     2     4    2 0.01
6     3     4    3 0.01

df2<-data.frame(Node1, Node2, value)
df2
  Node1 Node2 value
1     1     2     1
2     1     3     2
3     1     4     1
4     2     3     4
5     2     4     6
6     3     4     7

For each "value" row in df2, I want to calculate the z-score using the Mean and standard deviation (SD) from the corresponding Node pair (i.e., Node1 and Node2) from df1 as the reference values. E.g., for the first row from df2, the z-score should be ((1-2)/0.10) as the corresponding mean and SD are derived by matching Node1 and Node2 values from both dfs.

How do I perform this in R? My current solutions are using left_join to combine df1 and df2, then use mutate and case_when but I'm unable to go beyond this to extract the z-scores while retaining specific conditions (i.e., do while matching Node1 and Node2 values from both dfs)


Solution

  • I think you can just do:

    left_join(df1, df2, by = c('Node1', 'Node2')) %>%
      mutate(z_score = (Mean - value) / SD)
    #>   Node1 Node2 Mean   SD value z_score
    #> 1     1     2    2 0.10     1      10
    #> 2     1     3    1 0.20     2      -5
    #> 3     1     4    5 0.01     1     400
    #> 4     2     3    3 0.20     4      -5
    #> 5     2     4    2 0.01     6    -400
    #> 6     3     4    3 0.01     7    -400