Search code examples
rstatisticsbinningplrgam

Root mean square deviation on binned GAM results using R


Background

A PostgreSQL database uses PL/R to call R functions. An R call to calculate Spearman's correlation looks as follows:

cor( rank(x), rank(y) )

Also in R, a naïve calculation of a fitted generalized additive model (GAM):

data.frame( x, fitted( gam( y ~ s(x) ) ) )

Here x represents the years from 1900 to 2009 and y is the average measurement (e.g., minimum temperature) for that year.

Problem

The fitted trend line (using GAM) is reasonably accurate, as you can see in the following picture:

The problem is that the correlations (shown in the bottom left) do not accurately reflect how closely the model fits the data.

Possible Solution

One way to improve the accuracy of the correlation is to use a root mean square error (RMSE) calculation on binned data.

Questions

Q.1. How would you implement the RMSE calculation on the binned data to get a correlation (between 0 and 1) of GAM's fit to the measurements, in the R language?

Q.2. Is there a better way to find the accuracy of GAM's fit to the data, and if so, what is it (e.g., root mean square deviation)?

Attempted Solution 1

  1. Call the PL/R function using the observed amounts and the model (GAM) amounts:
    correlation_rmse := climate.plr_corr_rmse( v_amount, v_model );
  2. Define plr_corr_rmse as follows (where o and m represent the observed and modelled data):
    CREATE OR REPLACE FUNCTION climate.plr_corr_rmse(
    o double precision[], m double precision[])
    RETURNS double precision AS
    $BODY$
    sqrt( mean( o - m ) ^ 2 )
    $BODY$
    LANGUAGE 'plr' VOLATILE STRICT
    COST 100;
    

The o - m is wrong. I'd like to bin both data sets by calculating the mean of every 5 data points (there will be at most 110 data points). For example:

omean <- c( mean(o[1:5]), mean(o[6:10]), ... )
mmean <- c( mean(m[1:5]), mean(m[6:10]), ... )

Then correct the RMSE calculation as:

sqrt( mean( omean - mmean ) ^ 2 )

How do you calculate c( mean(o[1:5]), mean(o[6:10]), ... ) for an arbitrary length vector in an appropriate number of bins (5, for example, might not be ideal for only 67 measurements)?

I don't think hist is suitable here, is it?

Attempted Solution 2

The following code will solve the problem, however it drops data points from the end of the list (to make the list divisible by 5). The solution isn't ideal as the number "5" is rather magical.

while( length(o) %% 5 != 0 ) {
  o <- o[-length(o)]
}

omean <- apply( matrix(o, 5), 2, mean )

What other options are available?

Thanks in advance.


Solution

  • You say that:

    The problem is that the correlations (shown in the bottom left) do not accurately reflect how closely the model fits the data.

    You could calculate the correlation between the fitted values and the measured values:

    cor(y,fitted(gam(y ~ s(x))))
    

    I don't see why you want to bin your data, but you could do it as follows:

    mean.binned <- function(y,n = 5){
      apply(matrix(c(y,rep(NA,(n - (length(y) %% n)) %% n)),n),
            2,
            function(x)mean(x,na.rm = TRUE))
    }
    

    It looks a bit ugly, but it should handle vectors whose length is not a multiple of the binning length (i.e. 5 in your example).

    You also say that:

    One way to improve the accuracy of the correlation is to use a root mean square error (RMSE) calculation on binned data.

    I don't understand what you mean by this. The correlation is a factor in determining the mean squared error - for example, see equation 10 of Murphy (1988, Monthly Weather Review, v. 116, pp. 2417-2424). But please explain what you mean.