Search code examples
rdplyrnamedian

R: Calculate the column medians by grouping the ID's


Continuing from my previous post, Now I want to group by ID (only for Column 3) and calculate the median of the column (Point_B) and then subtract the median value with every value in the column (Point_B) to its respective group. NA's should still be returned.

Note: I want the ID grouping applied to only the Point_B column and not to Point_A as I want to calculate the median of the entire Point_A column and subtract it with the values in Point_A.

For example

ID <- c("A","A","A","B","B","B","C","C","C") 
Point_A <- c(1,2,NA,1,2,3,1,2,NA) 
Point_B <- c(1,2,3,NA,NA,1,1,1,3)

df <- data.frame(ID,Point_A ,Point_B)


+----+---------+---------+
| ID | Point_A | Point_B |
+----+---------+---------+
| A  | 1       | 1       |
| A  | 2       | 2       |
| A  | NA      | 3       |
| B  | 1       | NA      |
| B  | 2       | NA      |
| B  | 3       | 1       |
| C  | 1       | 1       |
| C  | 2       | 1       |
| C  | NA      | 3       |
+----+---------+---------+

The solution provided to my previous post calculates medians without grouping by ID. Here it is

library(dplyr)
 df %>%
     mutate_each(funs(median=.-median(., na.rm=TRUE)), -ID)

Desired Output

+----+---------+---------+
| ID | Point_A | Point_B |
+----+---------+---------+
| A  | -1      | -1      |
| A  | 0       | 0       |
| A  | NA      | 1       |
| B  | -1      | NA      |
| B  | 0       | NA      |
| B  | 1       | 0       |
| C  | -1      | 0       |
| C  | 0       | 0       |
| C  | NA      | 2       |
+----+---------+---------+

How do we get the values in Column3 with grouping by ID?


Solution

  • You'll want a group_by, I guess (following @docendodiscimus' suggestion):

    demed <- function(x) x-median(x,na.rm=TRUE)
    
    df %>% 
      mutate_each(funs(demed),Point_A) %>%
      group_by(ID) %>%  
      mutate_each(funs(demed),Point_B)
    

    giving

      ID Point_A Point_B
    1  A      -1      -1
    2  A       0       0
    3  A      NA       1
    4  B      -1      NA
    5  B       0      NA
    6  B       1       0
    7  C      -1       0
    8  C       0       0
    9  C      NA       2
    

    I prefer the analogous data.table code. Its syntax requires writing the variable names multiple times, but has far fewer parentheses:

    require(data.table)
    DT <- data.table(df)
    
    DT[,Point_A:=demed(Point_A)
    ][,Point_B:=demed(Point_B)
    ,by=ID]