I am working with a longitudinal data set with repeated observations for subjects in a long format data.table. Most subjects have a few (<10) repeated observations, while a few subjects have many (>100) observations. I can convert this data set from long to wide as below, but it becomes extremely wide (I have many variables at each time point) and is mostly full of NAs, since most subjects don't have data for variables at times 11 through 100. Is there a more elegant way to recast this data to wide format? I'm thinking something along the lines of a ragged array in other languages...
Some solutions exist here, but a big concern for me is object size: the wide matrix with lots of NAs takes up a lot of unnecessary space.
A MWE with my current (undesirably sparse matrix) solution is below. Ideally, if some sort of ragged list approach is feasible, the resulting object would have 3 rows and 3 columns, where the "year" and "code" columns are lists or similar. As a bonus, it would be wonderful if I could nest the "code" variable inside the "year" variable as nested ragged arrays.
library(data.table)
dat <- data.table(id=c(rep(1,5), rep(2,10), rep(3,85)),
year=sample(2013:2016, 100, replace=TRUE),
code=sample(LETTERS, 100, replace=TRUE))
wideDat <- dcast(dat, id~paste0("code", dat[,seq_len(.N), by=id]$V1),
value.var="code")
A few ideas
object.size(wideDat)
# 22432 bytes
# the following structures leverages the fact that years are missing
wideDat2 <- dcast(dat, id+year~code)
# id year A B C D E F G I J K L M N O P Q R S T U V W X Y Z
#1: 1 2014 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0
#2: 1 2015 0 0 1 0 0 0 0 0 1 0 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0
#3: 2 2013 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
#4: 2 2014 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0
#5: 2 2015 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
#6: 2 2016 0 1 0 0 0 0 1 0 0 1 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0
#7: 3 2013 1 0 0 1 0 1 1 2 2 0 1 1 2 1 1 2 3 0 1 0 0 2 0 0 1
#8: 3 2014 1 2 0 0 2 1 0 3 0 0 3 0 0 0 3 1 0 2 1 1 0 2 0 0 2
#9: 3 2015 0 2 1 0 0 0 0 2 2 1 1 0 0 0 1 0 3 1 2 1 2 1 1 0 0
#10: 3 2016 1 0 0 2 0 1 0 0 0 1 0 2 1 2 1 1 1 0 1 0 1 1 0 1 0
object.size(wideDat2)
# 6872 bytes
## the following struture just compresses the codes as strings
library(dplyr)
wideDat3 <- dat %>%
group_by(id, year) %>%
arrange(id, year, code) %>%
summarize(codes = paste0(code, collapse=","))
# id year codes
<# dbl> <int> <chr>
#1 1 2014 P
#2 1 2015 C,J,L,L
#3 2 2013 B
#4 2 2014 S,W
#5 2 2015 A,A
#6 2 2016 B,G,K,O,S
#7 3 2013 A,D,F,G,I,I,J,J,L,M,N,N,O,P,Q,Q,R,R,R,T,W,W,Z
#8 3 2014 A,B,B,E,E,F,I,I,I,L,L,L,P,P,P,Q,S,S,T,U,W,W,Z,Z
#9 3 2015 B,B,C,I,I,J,J,K,L,P,R,R,R,S,T,T,U,V,V,W,X
#10 3 2016 A,D,D,F,K,M,M,N,O,O,P,Q,R,T,V,W,Y
object.size(wideDat3)
# 2856 bytes
## .. or as nested list
wideDat4 <- dat %>%
group_by(id, year) %>%
arrange(id, year, code) %>%
summarize(codes = list(code))
# id year codes
#<dbl> <int> <list>
# 1 1 2014 <chr [1]>
# 2 1 2015 <chr [4]>
# 3 2 2013 <chr [1]>
# 4 2 2014 <chr [2]>
# 5 2 2015 <chr [2]>
# 6 2 2016 <chr [5]>
# 7 3 2013 <chr [23]>
# 8 3 2014 <chr [24]>
# 9 3 2015 <chr [21]>
# 10 3 2016 <chr [17]>
object.size(widedat4)
# 6776 bytes