Search code examples
rregressionnormal-distributionleast-squaresskew

Nonlinear least squares regression of skewed normal distribution in R (or any language)


First time poster. Apologize in advance if I use improper etiquette or vocabulary.

I've time series data of chemical concentration (y) vs time (x) from a USGS river survey. It exhibits a skew normal distribution that I would like to model via non-linear least squares regression. I'm able to fit a normal distribution curve to the data, but can't seem to incorporate "skewness" into the model.

I arrived at the my normal distribution fit from the answer given by Whuber here... Linear regression best polynomial (or better approach to use)?

my data and code...

y <- c(0.532431978850729, 0.609737363640599, 0.651964078008195, 0.657368066358271, 
0.741496240155044, 0.565435828629966, 0.703655525439792, 0.718855614453251, 
0.838983191559565, 0.743767469276213, 0.860155614137561, 0.81923941209205, 
1.07899884812998, 0.950877380129941, 1.01284743983765, 1.11717867112622, 
1.08452873942528, 1.14640319037414, 1.35601176845714, 1.55587090166098, 
1.81936731953165, 1.79952819117948, 2.27965075864338, 2.92158756334143, 
3.28092981974249, 1.09884083379528, 4.52126319475028, 5.50589160306292, 
6.48951979830975, 7.61196542128105, 9.56700470248019, 11.0814901164772, 
13.3072954022821, 13.8519364143597, 11.4108376964234, 8.72143939873907, 
5.12221325838613, 2.58106436004881, 1.0642701141608, 0.44945378376047, 
0.474569233285229, 0.128299654944011, 0.432876244482592, 0.445456125461339, 
0.435530646939433, 0.337503495863836, 0.456525976632425, 0.35851011819921, 
0.525854215793115, 0.381206935673774, 0.548351975353343, 0.365384673834335, 
0.418990479166088, 0.50039125911365, 0.490696977485334, 0.376809405620949, 
0.484559448760701, 0.569111550743562, 0.439671715276438, 0.353621820313257, 
0.444241243031233, 0.415197754444015, 0.474852839357701, 0.462144150397257, 
0.535339727332139, 0.480714031175711)

#creating an arbitrary vector to represent time
x <- seq(1,length(y), by=1)

#model of normal distribution 
f <- function(x, theta)  { 
  m <- theta[1]; s <- theta[2]; a <- theta[3]; b <- theta[4];
  a*exp(-0.5*((x-m)/s)^2) + b
}

# Estimate some starting values.
m.0 <- x[which.max(y)]; s.0 <- (max(x)-min(x))/4; b.0 <- min(y); a.0 <- (max(y)-min(y))

# Do the fit.  (It takes no time at all.)
fit <- nls(y ~ f(x,c(m,s,a,b)), data.frame(x,y), start=list(m=m.0, s=s.0, a=a.0, b=b.0))

# Display the estimated location of the peak and its SE.
summary(fit)$parameters["m", 1:2]

par(mfrow=c(1,1))
plot(c(x,0),c(y,f(coef(fit)["m"],coef(fit))), main="Data", type="n",
     xlab="Time", ylab="Concentration")
curve(f(x, coef(fit)), add=TRUE, col="Red", lwd=2)
points(x,y, pch=19)

So, any suggestions on how to adjust the model to accommodate skewness?

Cheers, Jamie


Solution

  • Can you use generalized additive model (GAM)? GAM is powerful and flexible, but it is difficult to interpret the model coefficients. So the decision would be depends on your purpose. If the purpose is to evaluate trend, or the purpose is to predict concentration (within the known time range), then GAM could be a good choice.

    library(mgcv)
    library(ggplot2)
    
    dat <- data.frame(x = 1:length(y), y = y)
    
    fit_gam <- gam(y ~ s(x, k = 20), data = dat) 
    
    ggplot(dat, aes(x = x, y = y)) +
      geom_point() +
      geom_line(data = data.frame(x = x, y = fit_gam$fitted.values),
                color = "red") +
      ggtitle("Data") +
      xlab("Cocentration") +
      ylab("Time") +
      theme_bw() +
      theme(panel.grid = element_blank())
    

    enter image description here

    The following is another option to apply stat_smooth to fit the same GAM model.

    ggplot(dat, aes(x = x, y = y)) +
      geom_point() +
      stat_smooth(method = "gam", formula = y ~ s(x, bs = "tp", k = 20)) +
      ggtitle("Data") +
      xlab("Cocentration") +
      ylab("Time") +
      theme_bw() +
      theme(panel.grid = element_blank())
    

    enter image description here