Search code examples
rbioinformaticsfactorialanovabiometrics

Factorial - Four Way ANOVA - How to find a statistically effective combination


I have to analyze an experiment data set to find a most effective combination of a molecular biology reaction.

The experiment has four factors: Temperature, RPM, Time, Catalytic activity. And I am measuring the Efficiency of a reaction (EE). How can I find an effective combination of four factors for the highest efficiency(EE)?

  • No repeated measurements. All data are independent experimental data

As I understood - EE is parametric data, factors are categorical data (Fixed combinations). Do I have to go for a Fourway ANOVA?

if so is this model correct for the analysis

library(lsmeans)
    
lm(EE ~ Temperature + RPM + Time+ Catalytic + 
Temperature:RPM +
Temperature:Time + 
Temperature:Catalytic + 
RPM:Time+
RPM+Catalytic+
Time+Catalytic+
Temperature:RPM:Time + 
Temperature:RPM:Catalytic+
Temperature:Time:Catalytic+
RPM:Time:Catalytic+
Temperature:RPM:Time:Catalytic, "data")

And, then how can I get the significant values for each pairwise comparison?

Here is the sample data set for an example.

> dput(df)
structure(list(TEMPERATURE = c(40, 40, 40, 40, 40, 40, 40, 40, 
40, 40, 40, 40, 40, 40, 40, 40, 40, 40, 40, 40, 40, 40, 40, 40, 
40, 40, 40, 40, 40, 40, 40, 40, 40, 40, 40, 40, 40, 40, 40, 40, 
42.5, 42.5, 42.5, 42.5, 42.5, 42.5, 42.5, 42.5, 42.5, 42.5, 42.5, 
42.5, 42.5, 42.5, 42.5, 42.5, 42.5, 42.5, 42.5, 42.5, 42.5, 42.5, 
42.5, 42.5, 42.5, 42.5, 42.5, 42.5, 42.5, 42.5, 42.5, 42.5, 42.5, 
42.5, 42.5, 42.5, 42.5, 42.5, 42.5, 42.5, 45, 45, 45, 45, 45, 
45, 45, 45, 45, 45, 45, 45, 45, 45, 45, 45, 45, 45, 45, 45, 45, 
45, 45, 45, 45, 45, 45, 45, 45, 45, 45, 45, 45, 45, 45, 45, 45, 
45, 45, 45), RPM = c(150, 150, 150, 150, 150, 150, 150, 150, 
150, 150, 150, 150, 150, 150, 150, 150, 150, 150, 150, 150, 200, 
200, 200, 200, 200, 200, 200, 200, 200, 200, 200, 200, 200, 200, 
200, 200, 200, 200, 200, 200, 150, 150, 150, 150, 150, 150, 150, 
150, 150, 150, 150, 150, 150, 150, 150, 150, 150, 150, 150, 150, 
200, 200, 200, 200, 200, 200, 200, 200, 200, 200, 200, 200, 200, 
200, 200, 200, 200, 200, 200, 200, 150, 150, 150, 150, 150, 150, 
150, 150, 150, 150, 150, 150, 150, 150, 150, 150, 150, 150, 150, 
150, 200, 200, 200, 200, 200, 200, 200, 200, 200, 200, 200, 200, 
200, 200, 200, 200, 200, 200, 200, 200), TIME = c(24, 24, 24, 
24, 24, 48, 48, 48, 48, 48, 72, 72, 72, 72, 72, 96, 96, 96, 96, 
96, 24, 24, 24, 24, 24, 48, 48, 48, 48, 48, 72, 72, 72, 72, 72, 
96, 96, 96, 96, 96, 24, 24, 24, 24, 24, 48, 48, 48, 48, 48, 72, 
72, 72, 72, 72, 96, 96, 96, 96, 96, 24, 24, 24, 24, 24, 48, 48, 
48, 48, 48, 72, 72, 72, 72, 72, 96, 96, 96, 96, 96, 24, 24, 24, 
24, 24, 48, 48, 48, 48, 48, 72, 72, 72, 72, 72, 96, 96, 96, 96, 
96, 24, 24, 24, 24, 24, 48, 48, 48, 48, 48, 72, 72, 72, 72, 72, 
96, 96, 96, 96, 96), CAT = c(4, 6, 8, 10, 12, 4, 6, 8, 10, 12, 
4, 6, 8, 10, 12, 4, 6, 8, 10, 12, 4, 6, 8, 10, 12, 4, 6, 8, 10, 
12, 4, 6, 8, 10, 12, 4, 6, 8, 10, 12, 4, 6, 8, 10, 12, 4, 6, 
8, 10, 12, 4, 6, 8, 10, 12, 4, 6, 8, 10, 12, 4, 6, 8, 10, 12, 
4, 6, 8, 10, 12, 4, 6, 8, 10, 12, 4, 6, 8, 10, 12, 4, 6, 8, 10, 
12, 4, 6, 8, 10, 12, 4, 6, 8, 10, 12, 4, 6, 8, 10, 12, 4, 6, 
8, 10, 12, 4, 6, 8, 10, 12, 4, 6, 8, 10, 12, 4, 6, 8, 10, 12), 
    EE = c(50, 53, 54, 57, 59, 53, 56, 59, 61, 64, 57, 58, 60, 
    62, 63, 56, 54, 52, 55, 55, 44, 48, 50, 50, 54, 49, 52, 56, 
    57, 56, 52, 56, 57, 58, 66, 46, 48, 48, 52, 49, 53, 57, 59, 
    62, 64, 54, 58, 60, 64, 66, 55, 59, 61, 63, 65, 54, 59, 64, 
    65, 67, 49, 51, 53, 54, 59, 50, 54, 63, 64, 64, 52, 56, 56, 
    59, 57, 52, 55, 58, 60, 63, 52, 56, 58, 61, 63, 54, 55, 58, 
    63, 63, 56, 58, 62, 62, 65, 57, 59, 62, 63, 66, 42, 42, 51, 
    54, 56, 46, 50, 52, 56, 58, 48, 51, 54, 55, 57, 48, 53, 56, 
    57, 61)), class = c("spec_tbl_df", "tbl_df", "tbl", "data.frame"
), row.names = c(NA, -120L), spec = structure(list(cols = list(
    TEMPERATURE = structure(list(), class = c("collector_double", 
    "collector")), RPM = structure(list(), class = c("collector_double", 
    "collector")), TIME = structure(list(), class = c("collector_double", 
    "collector")), CAT = structure(list(), class = c("collector_double", 
    "collector")), EE = structure(list(), class = c("collector_double", 
    "collector"))), default = structure(list(), class = c("collector_guess", 
"collector")), skip = 1), class = "col_spec"))

Solution

  • Try something like this:

    library(rsm)
    mod = rsm(EE ~ SO(Temperature, RPM, Time, Catalytic), data = data)
    summary(mod)
    

    This will fit a second-order surface (model equation includes all predictors, two-way interactions, and squares). The summary shows a stationary point and related statistics. If all the eigenvalues are negative, then it is a peak. Otherwise you have some kind of saddle point.

    This model differs from the one in the OP in that it has no three-way or four-way interactions, but includes the squares of the predictors, which are really essential for fitting a second-order response surface.

    Lots more details

    I had to revise this a bit to account for the fact that R is case-sensitive!

    > mod = rsm(EE ~ SO(TEMPERATURE, RPM, TIME, CAT), data = df)
    

    There is an issue here in that RPM has only two values, so we can't estimate a pure quadratic effect. So there is one coefficient of NA and that messes up the computation of the stationary point. However, we can still plot the fitted surface (despite a few warning messages)

    > par(mfrow = c(2,3))
    > contour(mod, ~TEMPERATURE+RPM+TIME+CAT)
    

    enter image description here

    It looks like we are best off with large CAT and lower RPM (see that plot), so look again:

    > par(mfrow=c(1,1))
    > contour(mod, ~ TEMPERATURE + TIME, at = list(CAT = 12, RPM = 150))
    

    enter image description here

    So visually, we seem to get the best response at around temperature 43.5, time 65, catalyst 12, and rpm 150.

    If you insist on modeling these as factors, it can be done, but you need to convert all the predictors to factors. This is a common error; you can have a designed experiment with only a few distinct values of a quantitative variable, but R does not read your mind and assume it's a factor; you have to convert it to one. In the following I have opted to fit a model with up to 2-way interactions.

    > facmod = lm(EE ~ (factor(TEMPERATURE) + factor(TIME) + factor(RPM) + factor(CAT))^2, data = df)
    > library(emmeans)
    > emmip(facmod, TIME ~ TEMPERATURE | CAT*RPM)
    

    enter image description here

    The highest fitted response is at catalyst 12, RPM 150, temperature 42.5 or larger, and time 96. It is clear that 150 RPM is better (left vs. right comoparisons) and the high CAT is better (comparing panels vertically). These are different models and somewhat different results. I like the rsm approach better as it is more systematic.