Search code examples
rstatisticsstan

Need to fix Stan code for the generalized pareto distribution to take in real arguments rather than vectors


I am using the functions defined here: Extreme value analysis and user defined probability functions in Stan for modeling the data with a generalized pareto distribution, but my problem is that my model is in a for-loop and expects three real valued arguments, whereas, the gpd functions assume a vector, real, real argument.

I’m not so sure that my model chunk is so amenable to being vectorized, and so I was thinking I would need to have the gpd functions take in real valued arguments (but maybe I’m wrong).

I’d appreciate any help with switching the code around to achieve this. Here is my stan code

functions {
  real gpareto_lpdf(vector y, real k, real sigma) {
    // generalised Pareto log pdf 
    int N = rows(y);
    real inv_k = inv(k);
    if (k<0 && max(y)/sigma > -inv_k)
      reject("k<0 and max(y)/sigma > -1/k; found k, sigma =", k, sigma)
    if (sigma<=0)
      reject("sigma<=0; found sigma =", sigma)
    if (fabs(k) > 1e-15)
      return -(1+inv_k)*sum(log1p((y) * (k/sigma))) -N*log(sigma);
    else
      return -sum(y)/sigma -N*log(sigma); // limit k->0
  }

  real gpareto_lcdf(vector y, real k, real sigma) {
    // generalised Pareto log cdf
    real inv_k = inv(k);
    if (k<0 && max(y)/sigma > -inv_k)
      reject("k<0 and max(y)/sigma > -1/k; found k, sigma =", k, sigma)
    if (sigma<=0)
      reject("sigma<=0; found sigma =", sigma)
    if (fabs(k) > 1e-15)
      return sum(log1m_exp((-inv_k)*(log1p((y) * (k/sigma)))));
    else
      return sum(log1m_exp(-(y)/sigma)); // limit k->0
  }
}

data {
  // the input data
  int<lower = 1> n;                       // number of observations
  real<lower = 0> value[n];              // value measurements
  int<lower = 0, upper = 1> censored[n];  // vector of 0s and 1s

  // parameters for the prior
  real<lower = 0> a;
  real<lower = 0> b;
}

parameters {
  real k;
  real sigma;
}

model {
 // prior
 k ~ gamma(a, b);
 sigma ~ gamma(a,b);

 // likelihood
  for (i in 1:n) {
    if (censored[i]) {
      target += gpareto_lcdf(value[i] | k, sigma);  
    } else {
      target += gpareto_lpdf(value[i] | k, sigma);
    }
  }
}

Solution

  • You can get the model to accept a vector and avoid a for loop. In part you need to change the signature when you declare value but then also feed in as data the indices of the censored and uncensored observations.

    I'm posting below code that runs on made up data, but the data I made up is just some random log-normal variates and not actual draws from a pareto so I have no idea if the model is recovering the parameters or what its coverage looks like. You'll probably want to do a bit of simulation based calibration to check the model.

    The stan code:

    functions {
      real gpareto_lpdf(vector y, real k, real sigma) {
        // generalised Pareto log pdf 
        int N = rows(y);
        real inv_k = inv(k);
        if (k<0 && max(y)/sigma > -inv_k)
          reject("k<0 and max(y)/sigma > -1/k; found k, sigma =", k, sigma);
        if (sigma<=0)
          reject("sigma<=0; found sigma =", sigma);
        if (fabs(k) > 1e-15)
          return -(1+inv_k)*sum(log1p((y) * (k/sigma))) -N*log(sigma);
        else
          return -sum(y)/sigma -N*log(sigma); // limit k->0
      }
    
      real gpareto_lcdf(vector y, real k, real sigma) {
        // generalised Pareto log cdf
        real inv_k = inv(k);
        if (k<0 && max(y)/sigma > -inv_k)
          reject("k<0 and max(y)/sigma > -1/k; found k, sigma =", k, sigma);
        if (sigma<=0)
          reject("sigma<=0; found sigma =", sigma);
        if (fabs(k) > 1e-15)
          return sum(log1m_exp((-inv_k)*(log1p((y) * (k/sigma)))));
        else
          return sum(log1m_exp(-(y)/sigma)); // limit k->0
      }
    }
    
    data {
      // the input data
      int<lower = 1> n;                       // number of observations
      int<lower = 0> n_cens;
      vector<lower = 0>[n] value;              // value measurements
      int<lower=1> cens_id[n_cens]; // the indices of censored observations
      int<lower=1> nocens_id[n - n_cens]; // the indices of uncensored obs
    
      // parameters for the prior
      real<lower = 0> a;
      real<lower = 0> b;
    }
    
    parameters {
      real<lower=0> k;
      real<lower=0> sigma;
    }
    
    model {
     // prior
     k ~ gamma(a, b);
     sigma ~ gamma(a,b);
    
     // likelihood
     target += gpareto_lcdf(value[cens_id] | k, sigma);  
     target += gpareto_lpdf(value[nocens_id] | k, sigma);
    }
    

    This runs basically instantaneously on my mid-tier laptop using cmdstanr. You can run it with the R code

    library(cmdstanr)
    library(bayesplot)
    
    gparet_mod <- cmdstan_model("gpareto_example.stan")
    
    # make some fake data. This won't be correct and its
    # no proper test of the capacity of the model to 
    # recover the parameters, just seeing if it runs
    
    N <- 100
    value <- rlnorm(N)
    censored <- rbinom(N, 1, .5)
    N_cens <- sum(censored)
    cens_id <- which(censored == 1)
    nocens_id <- which(censored == 0)
    
    a <- b <- 2
    
    dat <- list(n = N, n_cens = N_cens, value = value,
                cens_id = cens_id, nocens_id = nocens_id,
                a = a, b = b)
    
    ests <- gparet_mod$sample(data = dat, parallel_chains = 2, chains = 2)
    

    This produces what appear to be sane samples

    ests$summary()
    # A tibble: 3 × 10
      variable     mean   median    sd   mad       q5      q95  rhat ess_bulk ess_tail
      <chr>       <dbl>    <dbl> <dbl> <dbl>    <dbl>    <dbl> <dbl>    <dbl>    <dbl>
    1 lp__     -103.    -103.    1.10  0.767 -105.    -102.     1.00     750.     807.
    2 k           0.316    0.296 0.135 0.131    0.128    0.557  1.00     665.     570.
    3 sigma       0.691    0.685 0.113 0.110    0.511    0.887  1.00     772.     887.
    

    And posterior distributions

    mcmc_hist(ests$draws(variables = c("k", "sigma")))
    

    posterior_distribution_k_and_sigma

    I think the approach of Martin Arnold is a bit more elegant, but when I tried to get it to run on the same data the sampler broke down with divergences, so maybe I ported the model wrong or something needs tweaking.

    Edit

    Given the likelihood functions it makes more sense to constrain the parameters k and sigma to be positive. The model runs better and you'll get fewer error messages. I modified the stan code above to reflect this.