Analyzing non-linear data with R

I have following data in which there seems to be a curvilinear relation between xx and yy:

head(ddf)
  xx yy
1  1 10
2  2  9
3  3 11
4  4  9
5  5  7
6  6  6

ddf = structure(list(xx = 1:23, yy = c(10L, 9L, 11L, 9L, 7L, 6L, 9L, 
8L, 5L, 4L, 6L, 6L, 5L, 4L, 6L, 8L, 4L, 6L, 8L, 11L, 8L, 10L, 
9L)), .Names = c("xx", "yy"), class = "data.frame", row.names = c(NA, 
-23L))

with(ddf, plot(xx,yy))

enter image description here

I want to analyze this and get following:

Find the nonlinear relation between xx and yy
Get its equation
Get its P value
If possible get R (correlation coefficient) (nonlinear)
Plot this curve

I know of nls, which gives me an equation but I have to enter a formula, which may not be correct. Also I cannot get the curve and the R and P values here.

> nls(yy~a*(xx^b), data=ddf)
Nonlinear regression model
  model: yy ~ a * (xx^b)
   data: ddf
      a       b 
 9.5337 -0.1184 
 residual sum-of-squares: 95.85

Number of iterations to convergence: 8 
Achieved convergence tolerance: 3.407e-06
Warning message:
In nls(yy ~ a * (xx^b), data = ddf) :
  No starting values specified for some parameters.
Initializing ‘a’, ‘b’ to '1.'.
Consider specifying 'start' or using a selfStart model

I also know of stat_smooth of ggplot which can plot a curve. But that also does not give me the formula, R and P values.

Solution

You can predict the values along a new range of xx values and plot them. Regarding the results that you wanted:

# 1. Find the nonlinear relation between xx and yy
fit <- nls(yy ~ a*xx^b, data=ddf)
# 2. Get its equation
coef(fit)
# 3. Get its P value
summary(fit)
# 4. If possible get R (correlation coefficient) (nonlinear)
cor(predict(fit), ddf$yy)
# 5. Plot this curve
newdat <- data.frame(xx=seq(min(ddf$xx), max(ddf$xx),,100))
newdat$yy <- predict(fit, newdat)
plot(yy ~ xx, ddf)
lines(yy ~ xx, newdat, col=2)

enter image description here

Here is another option using a polynomial:

# 1. Find the nonlinear relation between xx and yy
fit <- lm(yy ~ poly(xx, n=2, raw=TRUE), data=ddf)
# 2. Get its equation
coef(fit)
# 3. Get its P value
summary(fit)
# 4. If possible get R (correlation coefficient) (nonlinear)
cor(predict(fit), ddf$yy)
# 5. Plot this curve
newdat <- data.frame(xx=seq(min(ddf$xx), max(ddf$xx),,100))
newdat$yy <- predict(fit, newdat)
plot(yy ~ xx, ddf)
lines(yy ~ xx, newdat, col=2)

enter image description here

And, finally, the GAM version:

# 1. Find the nonlinear relation between xx and yy
library(mgcv)
fit <- gam(yy ~ s(xx), data=ddf)
# 2. Get its equation
coef(fit)
# 3. Get its P value
summary(fit)
# 4. If possible get R (correlation coefficient) (nonlinear)
cor(predict(fit), ddf$yy)
# 5. Plot this curve
newdat <- data.frame(xx=seq(min(ddf$xx), max(ddf$xx),,100))
newdat$yy <- predict(fit, newdat)
plot(yy ~ xx, ddf)
lines(yy ~ xx, newdat, col=2)

enter image description here

You can see from the coefficients of the GAM model that this is a much larger model and more difficult to represent in a formula. But, you have a lot of flexibility in it's form, and it should reduce (i.e. via a smaller number of "knots") to a linear model if this is the best relationship.