Search code examples
rdata-modelinggammgcv

Error running binomial GAM in mgcv with proportional data


I'm trying to run a GAM on proportional data (numeric between 0 and 1). But I'm getting the warning

In eval(family$initialize) : non-integer #successes in a binomial glm!

Basically I'm modelling the number of occurrences of warm adapted species vs total occurrences of warm and cold adapted species against sea surface temperature and using data from another weather system (NAO) as a random effect, and three other categorical, parametric, variables.

m5 <- gam(prop ~ s(SST_mean) + s(NAO, bs="re") + WarmCold + Cycle6 + Region, 
          family=binomial, data=DAT_WC, method = "REML")

prop = proportion of occurrences, WarmCold = whether species is warm adapted or cold adapted, Cycle6 = 6 year time period, Region = one of 4 regions. A sample of my dataset is below

structure(list(WarmCold = structure(c(1L, 1L, 1L, 1L, 2L, 2L), .Label = c("Cold", 
"Warm"), class = "factor"), Season = structure(c(2L, 2L, 2L, 
2L, 2L, 2L), .Label = c("Autumn", "Spring", "Summer", "Winter"
), class = "factor"), Region = structure(c(1L, 2L, 3L, 4L, 1L, 
2L), .Label = c("OSPARII_N", "OSPARII_S", "OSPARIII_N", "OSPARIII_S"
), class = "factor"), Cycle6 = structure(c(1L, 1L, 1L, 1L, 1L, 
1L), .Label = c("1990-1995", "1996-2001", "2002-2007", "2008-2013", 
"2014-2019"), class = "factor"), WC.Strandings = c(18L, 10L, 
0L, 3L, 5L, 25L), SST_mean = c(7.4066298185553, 7.49153086390094, 
9.28247524767124, 10.8654859624361, 7.4066298185553, 7.49153086390094
), NAO = c(0.542222222222222, 0.542222222222222, 0.542222222222222, 
0.542222222222222, 0.542222222222222, 0.542222222222222), AMO = c(-0.119444444444444, 
-0.119444444444444, -0.119444444444444, -0.119444444444444, -0.119444444444444, 
-0.119444444444444), Total.Strandings = c(23, 35, 5, 49, 23, 
35), prop = c(0.782608695652174, 0.285714285714286, 0, 0.0612244897959184, 
0.217391304347826, 0.714285714285714)), row.names = c(NA, 6L), class = "data.frame")

From the literature (Zuur, 2009) it seems that a binomial distribution is the best used for proportional data. But it doesn't seem to be working. It's running but giving the above warning, and outputs that don't make sense. What am I doing wrong here?


Solution

  • This is a warning, not an error, but it does indicate that something is somewhat not correct; the binomial distribution has support on the non-negative integer values so it doesn't make sense to pass in non-integer values without the samples totals from which the proportions were formed.

    You can do this using the weights argument, which in this case should take a vector of integers containing the count total for each observation from which the proportion was computed.

    Alternatively, consider using family = quasibinomial if the mean-variance relationship is right for your data; the warming will go away, but then you'll not be able to use AIC and related tools that expect a real likelihood.

    If you proportions are true proportions then consider family = betar to fit a beta regression model, where the conditional distribution of the response has support on reals values on the unit interval (0, 1) (but technically not 0 or 1 — mgcv will add or subtract a small number to adjust the data if there are 0 or 1 values in the response).