Search code examples
rsyntaxdata.table

When using `data.table`'s DT[ i , j, by], is it possible to set the column types before hand?


I'm trying to calculate the correlation between two variables for multiple different groups (e.g. DT[, cor.test(var1, var2), group]). This works great whenever I use cor.test(var1, var2, method = 'pearson') but throws an error when I use cor.test(var1, var2, method = 'spearman').

library(data.table)
DT <- as.data.table(iris)

# works perfectly 
DT[,cor.test(Sepal.Length,Sepal.Width, method = 'pearson'), Species]
#       Species statistic parameter      p.value  estimate null.value
# 1:     setosa  7.680738        48 6.709843e-10 0.7425467          0
# 2:     setosa  7.680738        48 6.709843e-10 0.7425467          0
# 3: versicolor  4.283887        48 8.771860e-05 0.5259107          0
# 4: versicolor  4.283887        48 8.771860e-05 0.5259107          0
# 5:  virginica  3.561892        48 8.434625e-04 0.4572278          0
# 6:  virginica  3.561892        48 8.434625e-04 0.4572278          0
#    alternative                               method
# 1:   two.sided Pearson's product-moment correlation
# 2:   two.sided Pearson's product-moment correlation
# 3:   two.sided Pearson's product-moment correlation
# 4:   two.sided Pearson's product-moment correlation
# 5:   two.sided Pearson's product-moment correlation
# 6:   two.sided Pearson's product-moment correlation
#                       data.name  conf.int
# 1: Sepal.Length and Sepal.Width 0.5851391
# 2: Sepal.Length and Sepal.Width 0.8460314
# 3: Sepal.Length and Sepal.Width 0.2900175
# 4: Sepal.Length and Sepal.Width 0.7015599
# 5: Sepal.Length and Sepal.Width 0.2049657
#> 6: Sepal.Length and Sepal.Width 0.6525292

# error
DT[,cor.test(Sepal.Length,Sepal.Width, method = 'spearman'), Species]
# Error in `[.data.table`(DT, , cor.test(Sepal.Length, Sepal.Width, method = "spearman"), : 
# Column 2 of j's result for the first group is NULL. We rely on the column types of the first 
# result to decide the type expected for the remaining groups (and require consistency). NULL 
# columns are acceptable for later groups (and those are replaced with NA of appropriate type 
# and recycled) but not for the first. Please use a typed empty vector instead, such as 
# integer() or numeric().

Question:

I know there are work arounds for this specific example, but it is possible to tell data.table before hand what the column types are going to be for any case using DT[i,j,by = 'something']?


Solution

  • In case you want to keep all columns, rather than remove the ones with a NULL, You can set the class of the 'problem' column manually (in this case the column giving issues is "parameter") . This would be preferable to removing the NULLs, if the column does contain values for some groups but not others.

    DT[, {
      res <- cor.test(Sepal.Length, Sepal.Width, method = 'spearman')
      class(res$parameter) <- 'integer'
      res
      }, Species]
    
    #      Species statistic parameter      p.value  estimate null.value alternative                          method                    data.name
    #1:     setosa  5095.097        NA 2.316710e-10 0.7553375          0   two.sided Spearman's rank correlation rho Sepal.Length and Sepal.Width
    #2: versicolor 10045.855        NA 1.183863e-04 0.5176060          0   two.sided Spearman's rank correlation rho Sepal.Length and Sepal.Width
    #3:  virginica 11942.793        NA 2.010675e-03 0.4265165          0   two.sided Spearman's rank correlation rho Sepal.Length and Sepal.Width