Search code examples
rdata.tablemixture-model

how to apply complex model outputs neatly to data.table by a factor


I am using the normalmixEM function (algorithim) in R over a data.table object.

It is a simple procedure to run it over the full table. This outputs a mixEM list object, in which the $posterior item is the most interest. I can map that back to the data in an admittedly slightly awkward way using cbind like so:

library(data.table)
library(ggplot2)
library(mixtools)
set.seed(100)

faithfulDT <- data.table(faithful)
faithfulDT[, factorAB := rep(c('a', 'b'), .N)]
# Make some data

qplot(data = faithfulDT, x = eruptions, fill = factor) + facet_grid(factor ~.)
# graph the distribution

faithfulMix <- faithfulDT[, normalmixEM(eruptions)]
cbind(faithfulDT, data.table(faithfulMix$posterior)) # to join posterior probabilities to values. I'm ASSUMING this is the best way to do it?
plot(faithfulMix, whichplots = 2)
# model and graph without factorAB

However, I am struggling to include the by argument of data.table usefully in this workflow. My objective is to run a normalmixEM function by the factorAB. Effectively I want to run two, separated, isolated models on subsets of the data, and then have the posteriors to hand per value in two columns at the end of it in a single data.table as per it's underlying 'split-apply-combine' strategy.

library(data.table)
library(ggplot2)
library(mixtools)
set.seed(100)

faithfulDT <- data.table(faithful)
faithfulDT[, factorAB := rep(c('a', 'b'), .N)]
# Make some data

qplot(data = faithfulDT, x = eruptions, fill = factor) + facet_grid(factor ~.)
# graph the distribution

faithfulMix <- faithfulDT[, normalmixEM(eruptions)]
cbind(faithfulDT, data.table(faithfulMix$posterior)) # to join posterior probabilities to values. I'm ASSUMING this is the best way to do it?
plot(faithfulMix, whichplots = 2)
# model and graph without factorAB

faithfulMixAB <- faithfulDT[, normalmixEM(eruptions), by = factorAB]
# model and graph with factorAB - attempt by

faithfulMixAB <- faithfulDT[, normalmixEM(.SD$eruptions), by = factorAB]
# model and graph with factorAB - attempt by and .SD

faithfulMixAB <- faithfulDT[, normalmixEM(.SD), by = factorAB, .SDcols = "eruptions"]
# model and graph with factorAB - attempt by and .SD and .SDcols

faithfulMixAB <- faithfulDT[, lapply(.SD, normalmixEM), by = factorAB, .SDcols = "eruptions"]
# model and graph by factorAB - lapply
faithfulMixAB
# partial success?

faithfulMixABAssign <- faithfulDT[, mixMDL := lapply(.SD, normalmixEM), by = factorAB, .SDcols = "eruptions"]
# model and graph by factorAB - lapply and try to assign
faithfulMixABAssign
# even more partial success?

Evidently here I have successfully managed to mangle out a solution which appears to have the right numbers, but in largely arbitrary locations.

What am I missing in terms of this workflow which will neaten the output in the case of including the factorAB split? Evidentially I need to find a surrogate for the cbind portion of the work, however my current output is a mess to begin with. Can I improve the output at faithfulmixAB to facilitate this? Potentially skip this entierly and assign the posterior values straight from the function running within data.table?


EDIT

After help from @eddi and a friend IRL what I'm at right now is:

faithfulDT[, mixPostFull.1 := normalmixEM(eruptions)$posterior[,1]]
faithfulDT[, mixPostFull.2 := normalmixEM(eruptions)$posterior[,2]]

which represents the two posterior columns from running the model without splitting the factor. And:

faithfulDT[, mixPostAB.1 := normalmixEM(eruptions)$posterior[,1], by = factorAB]
faithfulDT[, mixPostAB.2 := normalmixEM(eruptions)$posterior[,2], by = factorAB]

Which has both columns, but does split by factor, which is actually what I'm trying to do.

I think that something like both of these are needed as the posterior object is actually 2 vectors, one indicating probability that record is in 1 and group, and the other indicating it is in the second.

Eddi, your current answer has 2 columns, but I don't think that those correspond as laid out above. If anything, the values are slightly different:

eruptions waiting factorAB mixPostFull.1 mixPostFull.2  mixPostAB.1  mixPostAB.2
  1:     3.600      79        a  5.376906e-10  1.000000e+00 1.581467e-11 1.000000e+00
  2:     1.800      54        b  9.999998e-01  1.723648e-07 1.000000e+00 2.112761e-09
  3:     3.333      74        a  1.755506e-06  9.999982e-01 1.405098e-07 9.999999e-01
  4:     2.283      62        b  9.999406e-01  5.939085e-05 9.999974e-01 2.599843e-06
  5:     4.533      85        a  2.215050e-25  1.000000e+00 3.658846e-29 1.000000e+00
 ---                                                                                 
268:     4.117      81        b  6.337730e-18  1.000000e+00 9.658721e-10 1.000000e+00
269:     2.150      46        a  9.999912e-01  8.828998e-06 9.999828e-01 1.724380e-05
270:     4.417      90        b  3.320219e-23  1.000000e+00 1.461450e-12 1.000000e+00
271:     1.817      46        a  9.999998e-01  2.012672e-07 9.999995e-01 4.981589e-07
272:     4.467      74        b  3.912776e-24  1.000000e+00 4.818983e-13 1.000000e+00

What I need really is a method to not have to run the model twice over. I'm pretty sure I can fudge ':=' in there somewhere, but I don't have the time right now. Will return to it in a bit.

Some time later So I overlooked in the previous that I can't just re-run the model to get the second column because, apart from obviously being pretty inefficient, because of the nature of the algorithm unless I set seed, as each run has a different start point, so it will settle on a slightly different answer from one run to the next.


Solution

  • I think you're looking for something like this:

    faithfulDT[, {
                   result = as.vector(normalmixEM(eruptions)$posterior);
                   faithfulDT[, paste0('result.', factorAB) := result];
                   NULL
                 }
               , by = factorAB]
    faithfulDT
    #     eruptions waiting factorAB     result.a     result.b
    #  1:     3.600      79        a 1.581719e-11 1.000000e+00
    #  2:     1.800      54        b 1.405263e-07 9.999974e-01
    #  3:     3.333      74        a 3.660230e-29 9.531090e-01
    #  4:     2.283      62        b 5.986926e-33 3.630698e-05
    #  5:     4.533      85        a 9.999983e-01 6.384911e-12
    # ---                                                     
    #268:     4.117      81        b 6.545978e-07 1.000000e+00
    #269:     2.150      46        a 2.342451e-06 1.562445e-06
    #270:     4.417      90        b 1.000000e+00 1.000000e+00
    #271:     1.817      46        a 1.724380e-05 1.000000e+00
    #272:     4.467      74        b 4.981589e-07 1.000000e+00
    

    After the discussions in the comments and in the OP, turns out the desired answer is:

    faithfulDT[, c('mixAB.1', 'mixAB.2') := as.data.table(normalmixEM(eruptions)$posterior)
               , by = factorAB]