I have to analyze an experiment data set to find a most effective combination of a molecular biology reaction.
The experiment has four factors: Temperature, RPM, Time, Catalytic activity. And I am measuring the Efficiency of a reaction (EE). How can I find an effective combination of four factors for the highest efficiency(EE)?
As I understood - EE is parametric data, factors are categorical data (Fixed combinations). Do I have to go for a Fourway ANOVA?
if so is this model correct for the analysis
library(lsmeans)
lm(EE ~ Temperature + RPM + Time+ Catalytic +
Temperature:RPM +
Temperature:Time +
Temperature:Catalytic +
RPM:Time+
RPM+Catalytic+
Time+Catalytic+
Temperature:RPM:Time +
Temperature:RPM:Catalytic+
Temperature:Time:Catalytic+
RPM:Time:Catalytic+
Temperature:RPM:Time:Catalytic, "data")
And, then how can I get the significant values for each pairwise comparison?
Here is the sample data set for an example.
> dput(df)
structure(list(TEMPERATURE = c(40, 40, 40, 40, 40, 40, 40, 40,
40, 40, 40, 40, 40, 40, 40, 40, 40, 40, 40, 40, 40, 40, 40, 40,
40, 40, 40, 40, 40, 40, 40, 40, 40, 40, 40, 40, 40, 40, 40, 40,
42.5, 42.5, 42.5, 42.5, 42.5, 42.5, 42.5, 42.5, 42.5, 42.5, 42.5,
42.5, 42.5, 42.5, 42.5, 42.5, 42.5, 42.5, 42.5, 42.5, 42.5, 42.5,
42.5, 42.5, 42.5, 42.5, 42.5, 42.5, 42.5, 42.5, 42.5, 42.5, 42.5,
42.5, 42.5, 42.5, 42.5, 42.5, 42.5, 42.5, 45, 45, 45, 45, 45,
45, 45, 45, 45, 45, 45, 45, 45, 45, 45, 45, 45, 45, 45, 45, 45,
45, 45, 45, 45, 45, 45, 45, 45, 45, 45, 45, 45, 45, 45, 45, 45,
45, 45, 45), RPM = c(150, 150, 150, 150, 150, 150, 150, 150,
150, 150, 150, 150, 150, 150, 150, 150, 150, 150, 150, 150, 200,
200, 200, 200, 200, 200, 200, 200, 200, 200, 200, 200, 200, 200,
200, 200, 200, 200, 200, 200, 150, 150, 150, 150, 150, 150, 150,
150, 150, 150, 150, 150, 150, 150, 150, 150, 150, 150, 150, 150,
200, 200, 200, 200, 200, 200, 200, 200, 200, 200, 200, 200, 200,
200, 200, 200, 200, 200, 200, 200, 150, 150, 150, 150, 150, 150,
150, 150, 150, 150, 150, 150, 150, 150, 150, 150, 150, 150, 150,
150, 200, 200, 200, 200, 200, 200, 200, 200, 200, 200, 200, 200,
200, 200, 200, 200, 200, 200, 200, 200), TIME = c(24, 24, 24,
24, 24, 48, 48, 48, 48, 48, 72, 72, 72, 72, 72, 96, 96, 96, 96,
96, 24, 24, 24, 24, 24, 48, 48, 48, 48, 48, 72, 72, 72, 72, 72,
96, 96, 96, 96, 96, 24, 24, 24, 24, 24, 48, 48, 48, 48, 48, 72,
72, 72, 72, 72, 96, 96, 96, 96, 96, 24, 24, 24, 24, 24, 48, 48,
48, 48, 48, 72, 72, 72, 72, 72, 96, 96, 96, 96, 96, 24, 24, 24,
24, 24, 48, 48, 48, 48, 48, 72, 72, 72, 72, 72, 96, 96, 96, 96,
96, 24, 24, 24, 24, 24, 48, 48, 48, 48, 48, 72, 72, 72, 72, 72,
96, 96, 96, 96, 96), CAT = c(4, 6, 8, 10, 12, 4, 6, 8, 10, 12,
4, 6, 8, 10, 12, 4, 6, 8, 10, 12, 4, 6, 8, 10, 12, 4, 6, 8, 10,
12, 4, 6, 8, 10, 12, 4, 6, 8, 10, 12, 4, 6, 8, 10, 12, 4, 6,
8, 10, 12, 4, 6, 8, 10, 12, 4, 6, 8, 10, 12, 4, 6, 8, 10, 12,
4, 6, 8, 10, 12, 4, 6, 8, 10, 12, 4, 6, 8, 10, 12, 4, 6, 8, 10,
12, 4, 6, 8, 10, 12, 4, 6, 8, 10, 12, 4, 6, 8, 10, 12, 4, 6,
8, 10, 12, 4, 6, 8, 10, 12, 4, 6, 8, 10, 12, 4, 6, 8, 10, 12),
EE = c(50, 53, 54, 57, 59, 53, 56, 59, 61, 64, 57, 58, 60,
62, 63, 56, 54, 52, 55, 55, 44, 48, 50, 50, 54, 49, 52, 56,
57, 56, 52, 56, 57, 58, 66, 46, 48, 48, 52, 49, 53, 57, 59,
62, 64, 54, 58, 60, 64, 66, 55, 59, 61, 63, 65, 54, 59, 64,
65, 67, 49, 51, 53, 54, 59, 50, 54, 63, 64, 64, 52, 56, 56,
59, 57, 52, 55, 58, 60, 63, 52, 56, 58, 61, 63, 54, 55, 58,
63, 63, 56, 58, 62, 62, 65, 57, 59, 62, 63, 66, 42, 42, 51,
54, 56, 46, 50, 52, 56, 58, 48, 51, 54, 55, 57, 48, 53, 56,
57, 61)), class = c("spec_tbl_df", "tbl_df", "tbl", "data.frame"
), row.names = c(NA, -120L), spec = structure(list(cols = list(
TEMPERATURE = structure(list(), class = c("collector_double",
"collector")), RPM = structure(list(), class = c("collector_double",
"collector")), TIME = structure(list(), class = c("collector_double",
"collector")), CAT = structure(list(), class = c("collector_double",
"collector")), EE = structure(list(), class = c("collector_double",
"collector"))), default = structure(list(), class = c("collector_guess",
"collector")), skip = 1), class = "col_spec"))
Try something like this:
library(rsm)
mod = rsm(EE ~ SO(Temperature, RPM, Time, Catalytic), data = data)
summary(mod)
This will fit a second-order surface (model equation includes all predictors, two-way interactions, and squares). The summary shows a stationary point and related statistics. If all the eigenvalues are negative, then it is a peak. Otherwise you have some kind of saddle point.
This model differs from the one in the OP in that it has no three-way or four-way interactions, but includes the squares of the predictors, which are really essential for fitting a second-order response surface.
I had to revise this a bit to account for the fact that R is case-sensitive!
> mod = rsm(EE ~ SO(TEMPERATURE, RPM, TIME, CAT), data = df)
There is an issue here in that RPM has only two values, so we can't estimate a pure quadratic effect. So there is one coefficient of NA and that messes up the computation of the stationary point. However, we can still plot the fitted surface (despite a few warning messages)
> par(mfrow = c(2,3))
> contour(mod, ~TEMPERATURE+RPM+TIME+CAT)
It looks like we are best off with large CAT and lower RPM (see that plot), so look again:
> par(mfrow=c(1,1))
> contour(mod, ~ TEMPERATURE + TIME, at = list(CAT = 12, RPM = 150))
So visually, we seem to get the best response at around temperature 43.5, time 65, catalyst 12, and rpm 150.
If you insist on modeling these as factors, it can be done, but you need to convert all the predictors to factors. This is a common error; you can have a designed experiment with only a few distinct values of a quantitative variable, but R does not read your mind and assume it's a factor; you have to convert it to one. In the following I have opted to fit a model with up to 2-way interactions.
> facmod = lm(EE ~ (factor(TEMPERATURE) + factor(TIME) + factor(RPM) + factor(CAT))^2, data = df)
> library(emmeans)
> emmip(facmod, TIME ~ TEMPERATURE | CAT*RPM)
The highest fitted response is at catalyst 12, RPM 150, temperature 42.5 or larger, and time 96. It is clear that 150 RPM is better (left vs. right comoparisons) and the high CAT is better (comparing panels vertically). These are different models and somewhat different results. I like the rsm approach better as it is more systematic.