Search code examples
rmedian

Median from multiple rows and columns in a data table with grouping


I have a data table with over 90000 observations and 1201 variables. All columns except the last one store numeric values, the last column is the column with names of source files (over 100). Here is a small sample of the data table:

library(data.table)
DT <- data.table(V1=sample(0:100,20,replace=TRUE), 
V2=sample(0:100,20,replace=TRUE), V3=sample(0:100,20,replace=TRUE), 
V4=sample(0:100,20,replace=TRUE), V5=sample(0:100,20,replace=TRUE), 
V6=sample(0:100,20,replace=TRUE), V7=sample(0:100,20,replace=TRUE), 
file=rep(c("A","B","C","D"), each = 5))

What I want to do is to calculate a median of ALL values in each group (file). So e.g. for group A the median would be calculated from rows 1,2,3,4,5 at once. In the next step, I would like to assign the medians to each of the rows depending on a group (expected output below).

The question seems to be simple, I have googled many similar questions regarding median/mean calculation depending on a group (aggregate as one of the most popular solutions). However, in all cases only one column is taken into account for the median calculation. Here are 7 (or in my original data 1200) and median does not accept that - I should provide a numerical vector. Therefore I have experimented with unlist, aggregate, dplyr package, tapply with any luck...

Due to the amount of data and groups (i.e. file) the code should be quite automatic and efficient... I would really appreciate your help!

Just a small example if the code which obviously has failed:

DT_median <- setDT(DT)[, DT_med := median(DT[,1:7]), by = file]

The expected result should look like this:

V1  V2  V3  V4  V5  V6  V7  file DT_med
42  78  9   0   60  46  65  A    37.5
36  36  46  45  5   96  64  A    37.5
83  31  92  100 15  2   9   A    37.5
36  16  49  82  32  4   46  A    37.5
29  17  39  6   62  52  97  A    37.5
37  70  17  90  8   10  93  B    47
72  62  68  83  96  77  20  B    47
10  47  29  2   93  16  30  B    47
69  87  7   47  96  17  8   B    47
23  70  72  27  10  86  49  B    47
78  51  13  33  56  6   39  C    51
28  92  100 5   75  33  17  C    51
71  82  9   20  34  83  22  C    51
62  40  84  87  37  45  34  C    51
55  80  55  94  66  96  12  C    51
93  1   99  97  7   77  6   D    41
53  55  71  12  19  25  28  D    41
27  25  28  89  41  22  60  D    41
91  25  25  57  21  98  27  D    41
2   63  17  53  99  65  95  D    41

Solution

  • As we want to calculate the median from all the values, grouped by 'file', unlist the Subset of Data.table (.SD), get the median and assign (:=) the output to create the new column 'DT_med'

    library(data.table)
    DT[, DT_med := median(unlist(.SD), na.rm = TRUE), by = file]