Having trouble fitting an exponential decay model to these data in R:

My problem: I have about 300 rows of data (more variables than the two attached, but only relevant on my end) and need to predict "xaq" (Y) from "seed_prev_num" (x). Their relationship seems to be an exponential decay, but all my attempts with NLS have yielded alternating errors of Missing value or an infinity produced or singular gradient matrix at initial parameter estimates.

Note: there are NO zeros, nulls or infinities in the dataset.
The purpose of this model is to have something that guides business rules in thresholding other models. Right now it is heuristic/best guess for performance and we want to more fully automate that process.
It is also important that I can export the coefficients to give another stakeholder an Excel equation from this model

Here is the graphical relationship between the two variables:

I would prefer not to have to remove outliers as they are meaningful to our data, subjectively.

I have tried: log-transform, multiple starting functions, NLS parameters (max of y and different values for b in the exponential function), spline regression and splitting into "higher" and "lower" datasets to model each separately. The log-transform model performs OK, but loses some power in the middle cases, which unfortunately for our business rules, is where the bulk of our data will be arriving.
Here's some code output from dput() to get you started, please let me know if you have questions below!

structure(list(seed_prev_num = c(2671089.52136, 1723563.978132, 
1519478.513963, 10359757.487541, 435039.876186, 400860.993997, 
4824556.66506, 7345496.18374, 7167489.404247, 1541803.448572, 
802314.685373, 2589889.980437, 1578945.817656, 593092.510586, 
9524646.88053, 1517305.29024, 12332649.496439, 1225895.745565, 
1723563.978132, 4029354.348235, 486209.416573, 2473918.859946, 
2200685.368227, 1687211.87222, 34608587.788775, 127034.804899, 
2174606.683551, 694443.762395, 1297414.562631, 269677.307445), 
    xaq = c(3.140014793, 9.363136176, 10.842283188, 2.495966588, 
    6.017711172, 3.124199113, 3.369492219, 2.702313072, 2.587612668, 
    3.000256279, 2.360256095, 4.987947212, 7.002252252, 16.762824783, 
    2.35521676, 10.671484375, 3.007449177, 7.73650282, 4.812929849, 
    2.976955136, 17.230394149, 4.272320716, 5.298590538, 6.414285714, 
    2.321626944, 9.362363919, 3.303806668, 10.004836415, 8.26450434, 
    5.726739927)), row.names = c(NA, -30L), class = c("tbl_df", 
"tbl", "data.frame"))

Solution

These all seem to work fine:

log-linear fit

m1 <- lm(log(xaq) ~ seed_prev_num, dd)

This assumes that the deviation is log-normal, i.e. that log(xaq) is Normally distributed with a constant variance. I don't understand what you mean by "loses some power in the middle cases" ...

log-link Gaussian GLM

This fits the model y ~ Normal(exp(a+b*x), sigma)

m2 <- glm(xaq ~ seed_prev_num,  family = gaussian(link = "log"), dd)

nls with log-link GLM starting values

This fits the same model.

m3 <- nls(xaq ~ a*exp(b*seed_prev_num),
          data = dd,
          start = list(a = exp(coef(m2)[1]), b = coef(m2)[2]))

These are both TRUE:

all.equal(coef(m2)[["(Intercept)"]], log(coef(m3)[["a"]]), tolerance = 1e-5)
all.equal(coef(m2)[["seed_prev_num"]], coef(m3)[["b"]], tolerance = 1e-5)

This trick is probably all you need. It also works if you use the starting values from the log-linear fit (start = list(a = exp(coef(m1)[1]), b = coef(m1)[2]))).

nls with arbitrary starting values but scaled x

I suspect that the primary problem is the scaling of your x variable ...

m4 <- nls(xaq ~ a*exp(b*seed_prev_num/1e7),
          data = dd,
          start = list(a = 1, b = 1))

all.equal(coef(m4)[["b"]]/1e7, coef(m3)[["b"]],
     tolerance = 1e-4)

The fits don't get much better: your management may have to live with the fact that these are noisy data.

I also tried mgcv::gam with a log-link Gaussian GLM and a smooth function of seed_prev_num, but it collapses back to an exponential fit. In any case, this model would violate your "easy to export coefficients and compute in Excel" rubric.
a power-exponential model:

m6 <- nls(xaq ~ a*exp(b*(seed_prev_num/1e7)^c),
          data = dd,
          start = list(a = 9, b = -2, c =  1.1)
)

does slightly better (increases the squared correlation between predictions and observed, aka one form of R^2, to 0.34) but is not a better model according to AIC ...