Search code examples
rdata.tablebioconductor

Releveling factor to facilitate use as nested factor in DESeq2 model in R


I am fitting a GLM using the DESeq2 package, and have the situation where individuals (RatIDs) are nested within the treatment (Diet). The author of the package suggests that the individuals be re-leveled from 1:N within each Diet (where N is the number of RatIDs within a specific Diet) rather than their original ID/factor level (DESeq2 vignette, page 35.)

The data looks something like this (there are actually more columns and rows, but omitted for simplicity):

     Diet Extraction RatID
199 HAMSP          8    65
74   HAMS          9   108
308  HAMS         18   100
41  HAMSA          3    83
88  HAMSP         12    11
221 HAMSP         14    66
200 HAMSA          8    57
155 HAMSB          1   105
245 HAMSB         19    50
254  HAMS         21    90
182 HAMSB          4     4
283 HAMSA         23    59
180 HAMSP          4    22
71  HAMSP          9   112
212  HAMS         12    63
220 HAMSP         14    54
56   HAMS          7    81
274 HAMSP          1    11
114  HAMS         17   102
143 HAMSP         22    93

And here is a dput() output for the structure:

data = structure(list(Diet = structure(c(4L, 1L, 1L, 2L, 4L, 4L, 2L, 
        3L, 3L, 1L, 3L, 2L, 4L, 4L, 1L, 4L, 1L, 4L, 1L, 4L), .Label = c("HAMS", 
        "HAMSA", "HAMSB", "HAMSP", "LAMS"), class = "factor"), Extraction = c(8L, 
        9L, 18L, 3L, 12L, 14L, 8L, 1L, 19L, 21L, 4L, 23L, 4L, 9L, 12L, 
        14L, 7L, 1L, 17L, 22L), RatID = structure(c(61L, 7L, 3L, 76L, 
        9L, 62L, 52L, 6L, 46L, 81L, 37L, 54L, 20L, 12L, 59L, 50L, 74L, 
        9L, 4L, 84L), .Label = c("1", "10", "100", "102", "103", "105", 
        "108", "109", "11", "110", "111", "112", "113", "13", "14", "16", 
        "17", "18", "20", "22", "23", "24", "25", "26", "27", "28", "29", 
        "3", "30", "31", "32", "34", "35", "36", "37", "39", "4", "40", 
        "42", "43", "45", "46", "48", "49", "5", "50", "51", "52", "53", 
        "54", "55", "57", "58", "59", "6", "60", "61", "62", "63", "64", 
        "65", "66", "67", "68", "69", "70", "71", "73", "77", "78", "79", 
        "8", "80", "81", "82", "83", "85", "86", "88", "89", "90", "91", 
        "92", "93", "94", "95", "96", "98", "99"), class = "factor")), .Names = c("Diet", 
        "Extraction", "RatID"), row.names = c(199L, 74L, 308L, 41L, 88L, 
        221L, 200L, 155L, 245L, 254L, 182L, 283L, 180L, 71L, 212L, 220L, 
        56L, 274L, 114L, 143L), class = "data.frame")

Can someone please specify an elegant way to generate the new factor levels for RatIDs within Diet as an additional column of the above data.frame. Could this be done with the roll function of data.table?

Desired output (done manually):

    Diet Extraction RatID newCol
1  HAMSP          8    65      1
2   HAMS          9   108      1
3   HAMS         18   100      2
4  HAMSA          3    83      1
5  HAMSP         12    11      2
6  HAMSP         14    66      3
7  HAMSA          8    57      2
8  HAMSB          1   105      1
9  HAMSB         19    50      2
10  HAMS         21    90      3
11 HAMSB          4     4      3
12 HAMSA         23    59      3
13 HAMSP          4    22      4
14 HAMSP          9   112      5
15  HAMS         12    63      4
16 HAMSP         14    54      6
17  HAMS          7    81      5
18 HAMSP          1    11      2
19  HAMS         17   102      6
20 HAMSP         22    93      7

NOTE: There are not an equal number of Rats in each treatment. I'd also like the solution to not re-order the rows in the data (if possible).

EDIT: There is no 'natural' order to the RatIDs, just as long as there is a 1:1 mapping within a diet, its fine.


Solution

  • You can convert the 'RatID' to 'factor' and coerce it back to 'numeric'

     library(data.table)#v1.9.4+
     setDT(data)[, newCol:=as.numeric(factor(RatID, 
                           levels=unique(RatID))), Diet]
     #      Diet Extraction RatID newCol
     # 1: HAMSP          8    65      1
     # 2:  HAMS          9   108      1
     # 3:  HAMS         18   100      2
     # 4: HAMSA          3    83      1
     # 5: HAMSP         12    11      2
     # 6: HAMSP         14    66      3
     # 7: HAMSA          8    57      2
     # 8: HAMSB          1   105      1
     # 9: HAMSB         19    50      2
     #10:  HAMS         21    90      3
     #11: HAMSB          4     4      3
     #12: HAMSA         23    59      3
     #13: HAMSP          4    22      4
     #14: HAMSP          9   112      5
     #15:  HAMS         12    63      4
     #16: HAMSP         14    54      6
     #17:  HAMS          7    81      5
     #18: HAMSP          1    11      2
     #19:  HAMS         17   102      6
     #20: HAMSP         22    93      7
    

    Or use match

     setDT(data)[, newCol:=match(RatID, unique(RatID)), Diet]
    

    Or similar option with base R

    data$newCol <- with(data, ave(as.numeric(levels(RatID))[RatID],
           Diet, FUN=function(x) match(x, unique(x))))