Search code examples
rmergedata.tabledata-munging

Avoiding merges and temporary data.tables: Refining my sloppy use of data.table code


I'm tabulating responses from a longitudinal study in which participants were asked to voluntarily respond to 4 surveys. Each participant has a unique PartID. Each participant is assigned a SectionID (letter). A survey that is attempted and completed is indicated by StatusID="Complete". A survey that is attempted and not completed is indicated by StatusID="Incomplete". A participant that does not attempt a survey will not have a record, but counts as "0" for that survey in the tabulation of results.

Input data example:

    PartID SectionID     Status SurveyID
 1:    100         A   Complete        1
 2:    100         A   Complete        2
 3:    100         A   Complete        3
 4:    100         A   Complete        4
 5:    101         B Incomplete        1
 6:    101         B   Complete        2
 7:    101         B   Complete        3
 8:    101         B   Complete        4
 9:    102         A Incomplete        1
10:    103         B Incomplete        4
11:    104         B Incomplete        2
12:    105         A   Complete        1
13:    105         A   Complete        1
14:    105         A   Complete        3

The following code works, but it's very sloppy. I'm assuming there is a cleaner more elegant way of using data.table to accomplish this data munging? In particular, I'd like to avoid the temporary variables, and the need to merge two data.tables.

library(data.table)
DT <- fread ("PartID,SectionID,Status,SurveyID
100,A,Complete,1
100,A,Complete,2
100,A,Complete,3
100,A,Complete,4
101,B,Incomplete,1
101,B,Complete,2
101,B,Complete,3
101,B,Complete,4
102,A,Incomplete,1
103,B,Incomplete,4
104,B,Incomplete,2
105,A,Complete,1
105,A,Complete,1
105,A,Complete,3\n")

setkey(DT, PartID)

DT2<-DT
setkey(DT2,PartID, SectionID)
DT2<-DT2[Status=="Complete",.(c1=sum(SurveyID==1),c2=sum(SurveyID==2),c3=sum(SurveyID==3), c4=sum(SurveyID==4)), by=.(PartID,SectionID)]
DT3<-DT
setkey(DT3,PartID, SectionID)
DT3<-DT3[Status=="Incomplete",.(i1=sum(SurveyID==1),i2=sum(SurveyID==2),i3=sum(SurveyID==3), i4=sum(SurveyID==4)), by=.(PartID,SectionID)]
DT4<-merge(DT2,DT3, all=TRUE )
DT4[is.na(DT4)] <- 0
DT4

The output that is achieved by the code above is correct, and is (note: c1 means Completed Survey #1, i1 means incomplete for survey #1. Also note that participants may submit more that one response per survey):

   PartID SectionID c1 c2 c3 c4 i1 i2 i3 i4
1:    100         A  1  1  1  1  0  0  0  0
2:    101         B  0  1  1  1  1  0  0  0
3:    102         A  0  0  0  0  1  0  0  0
4:    103         B  0  0  0  0  0  0  0  1
5:    104         B  0  0  0  0  0  1  0  0
6:    105         A  2  0  1  0  0  0  0  0

Solution

  • You could use dcast

    library(data.table)#v1.9.5+
    dcast(DT[, N :=.N,list(PartID, SectionID, SurveyID)][,
            Status1:= paste0(tolower(substr(Status,1,1)), SurveyID)], 
            PartID+SectionID~Status1, value.var='N', length)
    #   PartID SectionID c1 c2 c3 c4 i1 i2 i4
    #1:    100         A  1  1  1  1  0  0  0
    #2:    101         B  0  1  1  1  1  0  0
    #3:    102         A  0  0  0  0  1  0  0
    #4:    103         B  0  0  0  0  0  0  1
    #5:    104         B  0  0  0  0  0  1  0
    #6:    105         A  2  0  1  0  0  0  0
    

    If you need the i3

    DT1 <- DT[, N :=.N,list(PartID, SectionID, SurveyID)][,
        Status1:= paste0(tolower(substr(Status,1,1)), SurveyID)]
    DT2 <- data.table(Status1=paste0(rep(c('c', 'i'),each=4), 1:4))
     na.omit(dcast(setkey(DT1, Status1)[DT2], 
          PartID+SectionID~Status1, value.var='N', length))
     #   PartID SectionID c1 c2 c3 c4 i1 i2 i3 i4
     #1:    100         A  1  1  1  1  0  0  0  0
     #2:    101         B  0  1  1  1  1  0  0  0
     #3:    102         A  0  0  0  0  1  0  0  0
     #4:    103         B  0  0  0  0  0  0  0  1
     #5:    104         B  0  0  0  0  0  1  0  0
     #6:    105         A  2  0  1  0  0  0  0  0