Search code examples
rdata.tableaggregateexpand

How to expand/aggregate a data.table while including the existing row values?


I have the following R data.table:

 library(data.table)
 dt =

    unique_point biased    data_points   team   groupID                                                                                                           
 1: up1          FALSE     3             1      xy28352                                                                                                                 
 2: up1          TRUE      4             22     xy28352                                                                                                                 
 3: up2          FALSE     1             4      xy28352                                                                                                                  
 4: up2          TRUE      0             3      xy28352                                                                                                                  
 5: up3          FALSE     12            5      xy28352                                                                                                                 
 6: up3          TRUE      35            7      xy28352 
 ....  

I've formatted the data.table such that for each unique_point, I am measuring the data points for unbiased and biased. So each unique_point has two rows, biased FALSE and biased TRUE. If there are no measurements, this is recorded as 0.

As an example, for up1, there are 3 data points for the unbiased experiment, and 4 data points for the biased experiment.

Each groupID has 25 teams, each with potentially with a measurement for biased and unbiased. I would like to re-format the data.table so it calculates the number of data points by team as well, for each unique data points (due to the data, this will make rows have data_points of 0).

    unique_point biased    data_points   team   groupID                                                                                                           
 1: up1          FALSE     3             1      xy28352                                                                                                                 
 2: up1          TRUE      0             1      xy28352                                                                                                                 
 3: up1          FALSE     0             2      xy28352                                                                                                                  
 4: up1          TRUE      0             2      xy28352                                                                                                                  
 5: up1          FALSE     0             3      xy28352                                                                                                                 
 6: up1          TRUE      0             3      xy28352 
 ....  
 45. up1         TRUE      4            22      xy28352 
 ....
 49. up1         FALSE     0            25      xy28352 
 50. up1         TRUE      0            25      xy28352 

This task is very close to somehow "unfolding" the data.table. For each unique_point, I would create 50 rows, 25 teams with TRUE and FALSE. The added complication is that I need to use the counts above to fill in the above with the counts.

There should be a way to use unique() to count the times the rows exist possibly?

If I try

setkey(dt, team, unique_point)[CJ(unique(unique_point), unique(team)), .N, by=.EACHI]

I am counting the number of rows which occur for unique_point and team. But this wouldn't keep the data_points.


Solution

  • Using:

    DT2 <- DT[, .SD[CJ(team = 1:25, biased = biased, unique = TRUE), on = .(biased, team)], by = .(unique_point, groupID)
              ][is.na(data_points), data_points := 0][]
    setcolorder(DT2, c(1,3:5,2))
    

    gives:

    > DT2
         unique_point biased data_points team groupID
      1:          up1  FALSE           3    1 xy28352
      2:          up1   TRUE           0    1 xy28352
      3:          up1  FALSE           0    2 xy28352
      4:          up1   TRUE           0    2 xy28352
      5:          up1  FALSE           0    3 xy28352
     ---                                             
    146:          up3   TRUE           0   23 xy28352
    147:          up3  FALSE           0   24 xy28352
    148:          up3   TRUE           0   24 xy28352
    149:          up3  FALSE           0   25 xy28352
    150:          up3   TRUE           0   25 xy28352
    

    What this does:

    • You group DT by unique_point and groupID with by = .(unique_point, groupID)
    • The remaining columns are joined with a full reference table (CJ(team = 1:25, biased = biased)) for biased and team.
    • The expanded dataset will have NA-values for the rows that are not present in DT. Therefore you fill them with zero's with the [is.na(data_points), data_points := 0] part.
    • The last pair of square brackets ([]) aren't necessary, but makes printing on the console requiring a step less extra. For more info, see here.

    Using setcolorder(DT2, c(1,3:5,2)) isn't necessary & only necessary if you want to get the exact same column order as described in the question.

    As an alternative, you could also use:

    DT2 <- DT[CJ(unique_point = unique_point, biased = biased, team = 1:25, groupID = groupID, unique = TRUE), 
              on = .(unique_point, biased, team, groupID)
              ][is.na(data_points), data_points := 0][]
    

    The full first 60 rows:

    > DT2[1:60]
        unique_point biased data_points team groupID
     1:          up1  FALSE           3    1 xy28352
     2:          up1   TRUE           0    1 xy28352
     3:          up1  FALSE           0    2 xy28352
     4:          up1   TRUE           0    2 xy28352
     5:          up1  FALSE           0    3 xy28352
     6:          up1   TRUE           0    3 xy28352
     7:          up1  FALSE           0    4 xy28352
     8:          up1   TRUE           0    4 xy28352
     9:          up1  FALSE           0    5 xy28352
    10:          up1   TRUE           0    5 xy28352
    11:          up1  FALSE           0    6 xy28352
    12:          up1   TRUE           0    6 xy28352
    13:          up1  FALSE           0    7 xy28352
    14:          up1   TRUE           0    7 xy28352
    15:          up1  FALSE           0    8 xy28352
    16:          up1   TRUE           0    8 xy28352
    17:          up1  FALSE           0    9 xy28352
    18:          up1   TRUE           0    9 xy28352
    19:          up1  FALSE           0   10 xy28352
    20:          up1   TRUE           0   10 xy28352
    21:          up1  FALSE           0   11 xy28352
    22:          up1   TRUE           0   11 xy28352
    23:          up1  FALSE           0   12 xy28352
    24:          up1   TRUE           0   12 xy28352
    25:          up1  FALSE           0   13 xy28352
    26:          up1   TRUE           0   13 xy28352
    27:          up1  FALSE           0   14 xy28352
    28:          up1   TRUE           0   14 xy28352
    29:          up1  FALSE           0   15 xy28352
    30:          up1   TRUE           0   15 xy28352
    31:          up1  FALSE           0   16 xy28352
    32:          up1   TRUE           0   16 xy28352
    33:          up1  FALSE           0   17 xy28352
    34:          up1   TRUE           0   17 xy28352
    35:          up1  FALSE           0   18 xy28352
    36:          up1   TRUE           0   18 xy28352
    37:          up1  FALSE           0   19 xy28352
    38:          up1   TRUE           0   19 xy28352
    39:          up1  FALSE           0   20 xy28352
    40:          up1   TRUE           0   20 xy28352
    41:          up1  FALSE           0   21 xy28352
    42:          up1   TRUE           0   21 xy28352
    43:          up1  FALSE           0   22 xy28352
    44:          up1   TRUE           4   22 xy28352
    45:          up1  FALSE           0   23 xy28352
    46:          up1   TRUE           0   23 xy28352
    47:          up1  FALSE           0   24 xy28352
    48:          up1   TRUE           0   24 xy28352
    49:          up1  FALSE           0   25 xy28352
    50:          up1   TRUE           0   25 xy28352
    51:          up2  FALSE           0    1 xy28352
    52:          up2   TRUE           0    1 xy28352
    53:          up2  FALSE           0    2 xy28352
    54:          up2   TRUE           0    2 xy28352
    55:          up2  FALSE           0    3 xy28352
    56:          up2   TRUE           0    3 xy28352
    57:          up2  FALSE           1    4 xy28352
    58:          up2   TRUE           0    4 xy28352
    59:          up2  FALSE           0    5 xy28352
    60:          up2   TRUE           0    5 xy28352
    

    Used data:

    DT <- fread('unique_point biased    data_points   team   groupID                                                                                                           
     up1          FALSE     3             1      xy28352                                                                                                                 
     up1          TRUE      4             22     xy28352                                                                                                                 
     up2          FALSE     1             4      xy28352                                                                                                                  
     up2          TRUE      0             3      xy28352                                                                                                                  
     up3          FALSE     12            5      xy28352                                                                                                                 
     up3          TRUE      35            7      xy28352')