Search code examples
rdata.tablerbinddcast

dcast() - adding a column that doesnt exist in R


I have encountered a problem, that i am sure has a easy solution, but i cannot find it. I basically summarise my table to get the sum of an value per level of a factor variable:

 NOdependants <- unique(claimsMonthly[policyID == policy, .(exposure = sum(exposure)),
                                        by = c("productID", "Year", "product", "QualityCheck", "dependant")][order(Year)])

   productID       Year product QualityCheck dependant exposure
1:           1      2016  ELI18            0  EMPLOYEE 17.041096
2:           1      2016  ELI18            0    SPOUSE 40.484932
3:           1      2016  ELI18            0     CHILD 5.164384

and then i do the following:

NOdependants <- dcast(NOdependants,  productID + Year ~ dependant, value.var = "exposure", fill = 0, drop = FALSE, fun.aggregate = sum)
setnames(NOdependants, c("CHILD", "EMPLOYEE", "SPOUSE"), c("childno", "employeeno", "spouseno"), skip_absent=TRUE)

> NOdependants
   productRank startYear  childno employeeno spouseno
1:           1      2016 5.164384   17.041096 41.484932

This is all good so far. The problem is when a product doesnt have any data on one of the factor in dependant. Lets say there are no children:

 NOdependants <- unique(claimsMonthly[policyID == policy, .(exposure = sum(exposure)),
                                        by = c("productID", "Year", "product", "QualityCheck", "dependant")][order(Year)])

   productID       Year product QualityCheck dependant exposure
1:           1      2016  ELI18            0  EMPLOYEE 17.041096
2:           1      2016  ELI18            0    SPOUSE 40.484932

Then my dcast does the following:

> NOdependants
   productRank startYear  employeeno spouseno
1:           1      2016  17.041096 41.484932

and this is a problem for me, I need to have all three columns. SO what i need is to artifically create an additional columns in case a factor level is without data (like child in here), so i would get something this:

> NOdependants
   productRank startYear  childno employeeno spouseno
1:           1      2016       0   17.041096 41.484932

for now i have create a workround, where i first create an empty data.table and then use rbindlist with fill=0 to merge theses, but there must be some easier solution.

Any ideas?

Note: i am working with a lot of data and this operation is part of a loop which will be repeated around 80 times or so, so ideally something efficient is possible.

SIMPLIFIED EXAMPLE WITH DATA:

#
> claimsMonthly <- data.table(productID = c(rep(1,6), rep(2,3), rep(3,2)),
+                      Year = c(rep(2015,9), 2016, 2016),
+                      product = c(rep("ELI18",6), rep("JCI22",3), rep("ZDP01",2)),
+                      dependant = c(rep(c("EMPLOYEE", "SPOUSE", "CHILD"), 3),"EMPLOYEE", "SPOUSE"),
+                      QualityCheck = c(rep(0,11)),
+                      exposure = c(abs(rnorm(11))))
> 
> productIDs <- unique(claimsMonthly$productID)
> for(prod in productIDs){
+  
+   NOdependants <- unique(claimsMonthly[ productID == prod, .(exposure = sum(exposure)),
+                                         by = c("productID", "Year", "product", "QualityCheck", "dependant")][order(Year)])
+   
+   NOdependants <- dcast(NOdependants,  productID + Year ~ dependant, value.var = "exposure", fill = 0, drop = FALSE, fun.aggregate = sum)
+   setnames(NOdependants, c("CHILD", "EMPLOYEE", "SPOUSE"), c("childno", "employeeno", "spouseno"), skip_absent=TRUE) 
+ 
+   NOdependants[order(childno)]
+     
+ }
Error in .checkTypos(e, names_x) : 
  Object 'childno' not found amongst productID, Year, employeeno, spouseno

Solution

  • Your use of 'unique' outside of data.table brackets may perplex data.table here. Please See: https://www.rdocumentation.org/packages/data.table/versions/1.12.8/topics/duplicated

    I am wondering if your code could be simpler and achieve your result just as well. Some of the beauty of rdata.table is its ability to eliminate the need for loops and control structures. Using your sample data for 'claimsMonthly':

    claimsMonthly[, .(exposure = sum(exposure)),
    .(productID,Year,product,QualityCheck,dependant)][
    ,dcast(.SD, productID + Year ~ dependant,
    value.var = "exposure", drop = FALSE, fun.aggregate = sum)][
             CHILD == 0 &
             EMPLOYEE == 0 &
             SPOUSE == 0,.(productID,Year,CHILD,EMPLOYEE,SPOUSE)]
    
           productID Year CHILD EMPLOYEE SPOUSE
        1:         1 2016     0        0      0
        2:         2 2016     0        0      0
        3:         3 2015     0        0      0