I'm trying to do a brute-force head-to-head comparison of several statistical tests on the same simulated datasets. I'm going to generate several thousand 'control' and several thousand 'experimental' populations and run the same tests on each set. The wrapper that calls the tests is going to be called thousands of times.
My questions are:
I already have the simulated populations, and will use the appropriate apply function to pass the control and corresponding experimental observations to the wrapper.
The wrapper will have no arguments other than the control and experimental observations (let's call them xx
and yy
). Everything else will be hardcoded within the wrapper in order to avoid as much as possible the overhead of flow control logic and copying data between environments.
Each function to be called will be on a separate line, in a consistent format, in order of dependency (in the sense that, for example, cox.zph
depends on there already existing a coxph
object, so coxph()
will be called earlier than cox.zph()
). The functions will be wrapped in try()
and if a function fails, the output and the functions that depend on it first test whether the object it returned has try-error
as its first class and if it does, some kind of placeholder value.
The block of called functions will be followed by a long c()
statement with each item extracted from the respective fit objects on a separate line. Here too, if the source object turns out to be try-error
or a placeholder, put an NA
in that output slot.
This way, the whole run isn't aborted if some of the functions fail, and the output from each simulation is a numeric vector of the same length, suitable for capturing to a matrix.
Depending on the goals of a given set of simulations, I can comment out or insert additional tests and results as needed.
compilePKGS(T)
and enableJIT(3)
(from the built-in compiler
library), is there anything further to be gained by manually running compile()
or cmpfun()
on my wrapper function and the interpreted functions it calls? enableJIT()
value, or if I don't care about startup time, is "the more, the better"?browser()
, save out internal objects, etc. without having to abort the whole run. But, I imagine that pinging the file system that often will start to add up. Is there a consensus on the most efficient way to communicate a boolean value (i.e. source the debug script or don't) to a running R process (under Linux)?Thanks.
This will likely only address parts of your questions. I had luck speeding up processes by avoiding the apply function as well. apply is not vectorized and actually takes quite a bit of time. I saw gains using nested ifelse() statements.
Have you tried Rprof()? It was useful in my case in identifying slow elements of my code. Not a solution per se but a useful diagnostic.