I am using the normalmixEM
function (algorithim) in R over a data.table
object.
It is a simple procedure to run it over the full table. This outputs a mixEM
list object, in which the $posterior item is the most interest. I can map that back to the data in an admittedly slightly awkward way using cbind like so:
library(data.table)
library(ggplot2)
library(mixtools)
set.seed(100)
faithfulDT <- data.table(faithful)
faithfulDT[, factorAB := rep(c('a', 'b'), .N)]
# Make some data
qplot(data = faithfulDT, x = eruptions, fill = factor) + facet_grid(factor ~.)
# graph the distribution
faithfulMix <- faithfulDT[, normalmixEM(eruptions)]
cbind(faithfulDT, data.table(faithfulMix$posterior)) # to join posterior probabilities to values. I'm ASSUMING this is the best way to do it?
plot(faithfulMix, whichplots = 2)
# model and graph without factorAB
However, I am struggling to include the by
argument of data.table usefully in this workflow. My objective is to run a normalmixEM
function by
the factorAB
. Effectively I want to run two, separated, isolated models on subsets of the data, and then have the posteriors to hand per value in two columns at the end of it in a single data.table
as per it's underlying 'split-apply-combine' strategy.
library(data.table)
library(ggplot2)
library(mixtools)
set.seed(100)
faithfulDT <- data.table(faithful)
faithfulDT[, factorAB := rep(c('a', 'b'), .N)]
# Make some data
qplot(data = faithfulDT, x = eruptions, fill = factor) + facet_grid(factor ~.)
# graph the distribution
faithfulMix <- faithfulDT[, normalmixEM(eruptions)]
cbind(faithfulDT, data.table(faithfulMix$posterior)) # to join posterior probabilities to values. I'm ASSUMING this is the best way to do it?
plot(faithfulMix, whichplots = 2)
# model and graph without factorAB
faithfulMixAB <- faithfulDT[, normalmixEM(eruptions), by = factorAB]
# model and graph with factorAB - attempt by
faithfulMixAB <- faithfulDT[, normalmixEM(.SD$eruptions), by = factorAB]
# model and graph with factorAB - attempt by and .SD
faithfulMixAB <- faithfulDT[, normalmixEM(.SD), by = factorAB, .SDcols = "eruptions"]
# model and graph with factorAB - attempt by and .SD and .SDcols
faithfulMixAB <- faithfulDT[, lapply(.SD, normalmixEM), by = factorAB, .SDcols = "eruptions"]
# model and graph by factorAB - lapply
faithfulMixAB
# partial success?
faithfulMixABAssign <- faithfulDT[, mixMDL := lapply(.SD, normalmixEM), by = factorAB, .SDcols = "eruptions"]
# model and graph by factorAB - lapply and try to assign
faithfulMixABAssign
# even more partial success?
Evidently here I have successfully managed to mangle out a solution which appears to have the right numbers, but in largely arbitrary locations.
What am I missing in terms of this workflow which will neaten the output in the case of including the factorAB
split? Evidentially I need to find a surrogate for the cbind
portion of the work, however my current output is a mess to begin with. Can I improve the output at faithfulmixAB to facilitate this? Potentially skip this entierly and assign the posterior values straight from the function running within data.table?
EDIT
After help from @eddi and a friend IRL what I'm at right now is:
faithfulDT[, mixPostFull.1 := normalmixEM(eruptions)$posterior[,1]]
faithfulDT[, mixPostFull.2 := normalmixEM(eruptions)$posterior[,2]]
which represents the two posterior columns from running the model without splitting the factor. And:
faithfulDT[, mixPostAB.1 := normalmixEM(eruptions)$posterior[,1], by = factorAB]
faithfulDT[, mixPostAB.2 := normalmixEM(eruptions)$posterior[,2], by = factorAB]
Which has both columns, but does split by factor, which is actually what I'm trying to do.
I think that something like both of these are needed as the posterior object is actually 2 vectors, one indicating probability that record is in 1 and group, and the other indicating it is in the second.
Eddi, your current answer has 2 columns, but I don't think that those correspond as laid out above. If anything, the values are slightly different:
eruptions waiting factorAB mixPostFull.1 mixPostFull.2 mixPostAB.1 mixPostAB.2
1: 3.600 79 a 5.376906e-10 1.000000e+00 1.581467e-11 1.000000e+00
2: 1.800 54 b 9.999998e-01 1.723648e-07 1.000000e+00 2.112761e-09
3: 3.333 74 a 1.755506e-06 9.999982e-01 1.405098e-07 9.999999e-01
4: 2.283 62 b 9.999406e-01 5.939085e-05 9.999974e-01 2.599843e-06
5: 4.533 85 a 2.215050e-25 1.000000e+00 3.658846e-29 1.000000e+00
---
268: 4.117 81 b 6.337730e-18 1.000000e+00 9.658721e-10 1.000000e+00
269: 2.150 46 a 9.999912e-01 8.828998e-06 9.999828e-01 1.724380e-05
270: 4.417 90 b 3.320219e-23 1.000000e+00 1.461450e-12 1.000000e+00
271: 1.817 46 a 9.999998e-01 2.012672e-07 9.999995e-01 4.981589e-07
272: 4.467 74 b 3.912776e-24 1.000000e+00 4.818983e-13 1.000000e+00
What I need really is a method to not have to run the model twice over. I'm pretty sure I can fudge ':=' in there somewhere, but I don't have the time right now. Will return to it in a bit.
Some time later So I overlooked in the previous that I can't just re-run the model to get the second column because, apart from obviously being pretty inefficient, because of the nature of the algorithm unless I set seed, as each run has a different start point, so it will settle on a slightly different answer from one run to the next.
I think you're looking for something like this:
faithfulDT[, {
result = as.vector(normalmixEM(eruptions)$posterior);
faithfulDT[, paste0('result.', factorAB) := result];
NULL
}
, by = factorAB]
faithfulDT
# eruptions waiting factorAB result.a result.b
# 1: 3.600 79 a 1.581719e-11 1.000000e+00
# 2: 1.800 54 b 1.405263e-07 9.999974e-01
# 3: 3.333 74 a 3.660230e-29 9.531090e-01
# 4: 2.283 62 b 5.986926e-33 3.630698e-05
# 5: 4.533 85 a 9.999983e-01 6.384911e-12
# ---
#268: 4.117 81 b 6.545978e-07 1.000000e+00
#269: 2.150 46 a 2.342451e-06 1.562445e-06
#270: 4.417 90 b 1.000000e+00 1.000000e+00
#271: 1.817 46 a 1.724380e-05 1.000000e+00
#272: 4.467 74 b 4.981589e-07 1.000000e+00
After the discussions in the comments and in the OP, turns out the desired answer is:
faithfulDT[, c('mixAB.1', 'mixAB.2') := as.data.table(normalmixEM(eruptions)$posterior)
, by = factorAB]