Search code examples
rplotcurveloess

Scatterplot with best fit curve selected by AIC


I am trying to write a function that does the following: 1. Produces and x,y scatterplot 2. Add a loess curve 3. Add a curve based on an AIC model fitting procedure where the best fit model could be linear, quadratic, or cubic. I want only a single line drawn for this step (i.e., the best fit, not all 3 possibilities).

I can do steps 1 and 2, but can't make #3 work. Where am I going wrong? Sample data below, but I will be running this function on various data sets, some of which will vary in length and NAs.

maindata = structure(c(1952, 1953, 1954, 1955, 1956, 1957, 1958, 1959, 1960, 
1961, 1962, 1963, 1964, 1965, 1966, 1967, 1968, 1969, 1970, 1971, 
1972, 1973, 1974, 1975, 1976, 1977, 1978, 1979, 1980, 1981, 1982, 
1983, 1984, 1985, 1986, 1987, 1988, 1989, 1990, 1991, 1992, 1993, 
1994, 1995, 1996, 1997, 1998, 1999, 2000, 2001, 2002, 2003, 2004, 
2005, 2006, 2007, 2008, 2009, 2010, 2011, 2012, 2013, -1.54, 
1.5, 3.26, -2.79, 0.54, -0.51, -2.12, 1.83, -1.88, 0.47, -1.05, 
-2.16, -1.04, -1.77, -2.54, 1.67, -2.97, -2.58, NA, -0.08, 
2.05, 0.27, 2.18, 0.01, NA, -2.08, -0.42, -0.23, -1.58, -0.55, 
2.63, 0.38, 2.17, -3.09, 3.14, -3.01, -0.13, 2.38, 3.88, 1.14, 
2.54, 1.71, 2.86, -1.11, -1.98, -0.93, 1.03, 2.25, 1.18, -1.91, 
1, -0.09, 0.7, -1.35, -0.2, 1.35, 1.72, 0.72, -5.96, 2.95, -0.25, 
NA, 47, 40, NA, 20, 70, 80, 30, 33, 40, 71, 63, 25, 66, 41, 25, 
38, 18, 22, 60, 85, 30, 75, 25, 80, 65, 33, 85, 95, 45, 75, 19, 
75, 27, 13, 14, 15, 99, 22, 10, NA, 20, 35, 17, 55, 35, 70, 47, 
24, 45, 38, 50, 90, 60, 50, 100, 42, 34, 55, 10, 15, 90, NA), .Dim = c(62L, 
3L), .Dimnames = list(NULL, c("year", "x_values", "y_values")))


dd_plot = function(x, y, yaxis, xaxis) {
  subset.data = subset(maindata, x!="NA" & y!="NA"); 
  plot(subset.data$y~subset.data$x, pch = 20, ylab = yaxis, xlab=xaxis)
    lines(loess.smooth(subset.data$x, subset.data$y), col = "blue", lwd = 2, lty =2)
  fit = stepAIC(lm(subset.data$y~subset.data$x+I(subset.data$x^2)+I(subset.data$x^3)))
  lines(subset.data$x, predict(fit), col="red")
    legend("topleft", c("Lowess","Best Fit Polynomial"), lty = c(2,1), col=c("blue","red"), bty="n", xjust = -0.2)
}

dd_plot(y = y_values, x = x_values, yaxis = "Y_label", xaxis = "X_label")

Solution

  • Well, you don't exactly describe which part "doesn't work" or in which way it doesn't match you expectations. You sample has a bunch of mismatching variable names which perhaps your real data doesn't, bit it's looking for x and y in maindata which actually has x_values and y_values. Then your call to dd_plot references variables x_values and y_values which don't exist.

    But i'll assume you got past that are talking about the ball of string that's plotted rather than the simple curve for the best fit polynomial. The problem is that your data should be sorted by x-value when drawing lines. R just connects consecutive points. So i think you want

    lines(subset.data$x[order(subset.data$x)], 
        predict(fit)[order(subset.data$x)], col="red")
    

    instead of plotting the unsorted data. That returns

    enter image description here