Search code examples
rdataframestanrstan

How to represent a categorical predictor rstan?


What is the proper way to format a categorical predictor to use in STAN? I cannot seem to input a categorical predictor as a normal factor variable, so what is the quickest way to transform a normal categorical variable such that Stan can accept it?

For example, say I had a a continue predictor and a categorical predictor

your_dataset = data.frame(income = c(62085.59, 60806.33, 60527.27, 67112.64, 57675.92, 58128.44, 60822.47, 55805.80, 63982.99, 64555.45),
country = c("England", "England", "England", "USA", "USA", "USA", "South Africa", "South Africa", "South Africa", "Belgium"))

Which looks like this:

     income      country
1  62085.59      England
2  60806.33      England
3  60527.27      England
4  67112.64          USA
5  57675.92          USA
6  58128.44          USA
7  60822.47 South Africa
8  55805.80 South Africa
9  63982.99 South Africa
10 64555.45      Belgium

How would I prepare this to be entered in rstan?


Solution

  • It is correct that Stan only inputs real or integeger variables. In this case, you want to convert a categorical predictor into dummy variables (perhaps excluding a reference category). In R, you can do something like

    dummy_variables <- model.matrix(~ country, data = your_dataset)
    

    Which will look like this

       (Intercept) countryEngland countrySouth Africa countryUSA
    1            1              1                   0          0
    2            1              1                   0          0
    3            1              1                   0          0
    4            1              0                   0          1
    5            1              0                   0          1
    6            1              0                   0          1
    7            1              0                   1          0
    8            1              0                   1          0
    9            1              0                   1          0
    10           1              0                   0          0
    attr(,"assign")
    [1] 0 1 1 1
    attr(,"contrasts")
    attr(,"contrasts")$country
    [1] "contr.treatment"
    

    However, that might not come out to the right number of observations if you have unmodeled missingness on some other variables. This approach can be taken a step farther by inputting the entire model formula like

    X <- model.matrix(outcome ~ predictor1 + predictor2 ..., data = your_dataset)
    

    Now, you have an entire design matrix of predictors that you can use in a .stan program with linear algebra, such as

    data {
      int<lower=1> N;
      int<lower=1> K;
      matrix[N,K]  X;
      vector[N]    y;
    }
    parameters {
      vector[K] beta;
      real<lower=0> sigma;
    }
    model {
      y ~ normal(X * beta, sigma); // likelihood
      // priors
    }
    

    Utilizing a design matrix is recommended because it makes your .stan program reusable with different variations of the same model or even different datasets.