I am creating very basic plots to visualize trends in a dataset. I'm using the gam smoother with geom_smooth, but this is over-fitting the data at several sites. For example, in the very first facet, the smoother over-fits the red data points.
Is there a way to adjust my code to set a max # of inflection points to focus on the overarching trends in the dataset (e.g., max. 2 or 3 inflection points).
ggplot() +
stat_summary(fun = "mean",
data = df,
aes(x = year, y = morpho, group = interaction(year, group), col = group),
shape = 1, alpha = 0.4) +
geom_smooth(method = "gam", se = TRUE,
formula = y ~ s(x, bs = "cs"),
data = df,
aes(x = year, y = morpho, group = group, col = group)) +
facet_grid(. ~ site) +
theme_bw() +
theme(legend.position = "bottom",
legend.direction = "horizontal",
axis.text.x = element_text(angle = 60, vjust = 1, hjust=1))
EDIT (2022-12-19): following some of the comments raised by @GavinSimpson, I've tried to make some visuals of how changing k changes the shape of geom_smooth curve.
For these plots, I've tried to create an example dataset that has a rough quadratic section followed by a roughly linear section. Ideally, I want to find a k value that would capture the inflection point at the 'peak' of the quadratic section and the transition from quadratic to linear (x ~= 5 & 10).
library(ggpubr)
library(ggplot)
df <- data.frame(
x1 = c(1:19),
y1 = c(1.1, 4, 6.9, 8.5, 10, 7.2, 3.7, 2.2, 1, #quadratic
0.4, 1.6, 2.5, 3.7, 4.6, 5.8, 6.7, 7.9, 8.8, 10)) #linear
pk <- ggplot(data = df,
aes(y = y1, x = x1)) +
geom_point(col = "red") +
theme_bw()
# vary k from 2 to not specified
ggarrange(nrow = 3, ncol = 2,
pk + geom_smooth(method = "gam", formula = y ~ s(x, bs = "cs", k = 2)) + ggtitle("k=2"),
pk + geom_smooth(method = "gam", formula = y ~ s(x, bs = "cs", k = 3)) + ggtitle("k=3"),
pk + geom_smooth(method = "gam", formula = y ~ s(x, bs = "cs", k = 4)) + ggtitle("k=4"),
pk + geom_smooth(method = "gam", formula = y ~ s(x, bs = "cs", k = 5)) + ggtitle("k=5"),
pk + geom_smooth(method = "gam", formula = y ~ s(x, bs = "cs", k = 6)) + ggtitle("k=6"),
pk + geom_smooth(method = "gam", formula = y ~ s(x, bs = "cs")) + ggtitle("k unconstrained"))
The first plot throws an error message: Warning message: In smooth.construct.cr.smooth.spec(object, data, knots) : basis dimension, k, increased to minimum possible Indicating that R is increasing the k from 2 to 3.
Only when k = 5 do we start to see the smoother begin to capture the two main inflection points in the data (x ~= 5 & 10). But what does k = 5 mean? It also seems that it's added another slight inflection point around x ~= 16.
At k = 6, the SE is reduced and the line comes even closer to each data point, and it looks nearly identical to the figure when k is not specified at all, which might be ok in this situation, but for my dataset, when k isn't specified there is a lot of over-fitting when I really just want to see the most important inflection points.
So still no clear answer on how to specify inflection points.
Thanks to the suggestion from @tpetzoldt, I was able to identify how to set the max. number of inflection points in the gam formula.
ggplot() +
stat_summary(fun = "mean",
data = df,
aes(x = year, y = morpho, group = interaction(year, group), col = group),
shape = 1, alpha = 0.4) +
geom_smooth(method = "gam", se = TRUE,
formula = y ~ s(x, bs = "cs", k = 2), #k = number of inflection points
data = df,
aes(x = year, y = morpho, group = group, col = group)) +
facet_grid(. ~ site) +
theme_bw() +
theme(legend.position = "bottom",
legend.direction = "horizontal",
axis.text.x = element_text(angle = 60, vjust = 1, hjust=1))