Here is the graphical relationship between the two variables:
I have tried: log-transform, multiple starting functions, NLS parameters (max of y and different values for b in the exponential function), spline regression and splitting into "higher" and "lower" datasets to model each separately. The log-transform model performs OK, but loses some power in the middle cases, which unfortunately for our business rules, is where the bulk of our data will be arriving.
Here's some code output from dput() to get you started, please let me know if you have questions below!
structure(list(seed_prev_num = c(2671089.52136, 1723563.978132,
1519478.513963, 10359757.487541, 435039.876186, 400860.993997,
4824556.66506, 7345496.18374, 7167489.404247, 1541803.448572,
802314.685373, 2589889.980437, 1578945.817656, 593092.510586,
9524646.88053, 1517305.29024, 12332649.496439, 1225895.745565,
1723563.978132, 4029354.348235, 486209.416573, 2473918.859946,
2200685.368227, 1687211.87222, 34608587.788775, 127034.804899,
2174606.683551, 694443.762395, 1297414.562631, 269677.307445),
xaq = c(3.140014793, 9.363136176, 10.842283188, 2.495966588,
6.017711172, 3.124199113, 3.369492219, 2.702313072, 2.587612668,
3.000256279, 2.360256095, 4.987947212, 7.002252252, 16.762824783,
2.35521676, 10.671484375, 3.007449177, 7.73650282, 4.812929849,
2.976955136, 17.230394149, 4.272320716, 5.298590538, 6.414285714,
2.321626944, 9.362363919, 3.303806668, 10.004836415, 8.26450434,
5.726739927)), row.names = c(NA, -30L), class = c("tbl_df",
"tbl", "data.frame"))
These all seem to work fine:
m1 <- lm(log(xaq) ~ seed_prev_num, dd)
This assumes that the deviation is log-normal, i.e. that log(xaq)
is Normally distributed with a constant variance. I don't understand what you mean by "loses some power in the middle cases" ...
This fits the model y ~ Normal(exp(a+b*x), sigma)
m2 <- glm(xaq ~ seed_prev_num, family = gaussian(link = "log"), dd)
This fits the same model.
m3 <- nls(xaq ~ a*exp(b*seed_prev_num),
data = dd,
start = list(a = exp(coef(m2)[1]), b = coef(m2)[2]))
These are both TRUE:
all.equal(coef(m2)[["(Intercept)"]], log(coef(m3)[["a"]]), tolerance = 1e-5)
all.equal(coef(m2)[["seed_prev_num"]], coef(m3)[["b"]], tolerance = 1e-5)
This trick is probably all you need. It also works if you use the starting values from the log-linear fit (start = list(a = exp(coef(m1)[1]), b = coef(m1)[2]))
).
I suspect that the primary problem is the scaling of your x variable ...
m4 <- nls(xaq ~ a*exp(b*seed_prev_num/1e7),
data = dd,
start = list(a = 1, b = 1))
all.equal(coef(m4)[["b"]]/1e7, coef(m3)[["b"]],
tolerance = 1e-4)
The fits don't get much better: your management may have to live with the fact that these are noisy data.
mgcv::gam
with a log-link Gaussian GLM and a smooth function of seed_prev_num
, but it collapses back to an exponential fit. In any case, this model would violate your "easy to export coefficients and compute in Excel" rubric.m6 <- nls(xaq ~ a*exp(b*(seed_prev_num/1e7)^c),
data = dd,
start = list(a = 9, b = -2, c = 1.1)
)
does slightly better (increases the squared correlation between predictions and observed, aka one form of R^2, to 0.34) but is not a better model according to AIC ...