Search code examples
rpermutationhypothesis-testpermute

Setting up a permutation design (using permute package)


The main question I have is how to create a permutation design for the following sampling design: Data was collected under a shrub and directly nearby, outside the shrub influence. We called this 'pair' (df$pair); Three different shrub species were chosen, and we expect that the response will vary for each shrub. For example, shrub 1 will increase the response variable while shrub 2 will not change and shrub 3 will decrease. (We are not interested whether the response changes among a specific condition among each shrub, such as a comparison between under shrub 1 vs outside shrub 2). I am not sure if I should include the shrub species (either shrub or condition_shrub) as a 'block' in the perm design or not.

The hypothesis is being tested through a multivariate glm using mvabund package.

Data for permutations is similar to this:

df <- data.frame(
  pair = rep(1:30, each = 2),
  condition = rep(c("under", "outside"), times = 10),
  shrub = rep(c("shrub1", "shurb2", "shrub3"), each = 20),
  condition_shrub = c(
    rep(c(rep("shrub1", 1), "outside"), times = 10),
    rep(c(rep("shrub2", 1), "outside"), times = 10),
    rep(c(rep("shrub3", 1), "outside"), times = 10)
  )
)

So far I came up with this solution, but I am not sure this is correct to test our hypothesis.

I started building a model like this:

model <- manyglm(response ~ shrub*pair,
  family = negative.binomial, data = df)

However, I realized that in this way I would up testing whether conditions within shrubs change among each other (such as condition under shrub 1 vc condition outside shrub 3).

So I come up with another model that seems to make more sense:

model <- manyglm(response ~ condition_shrub,
  family = negative.binomial, data = df).

In this way, I treated the outside condition as a factor level (along with all the three shrub species).

perm.ctrl <- how(
  within=Within(type="free"),
  plots  = Plots(strata = df$pair, type = "free"),
  blocks = df$shrub,
  nperm = 9999)

I understand that with this design the permutations are occurring always within each shrub type (and outside) and the paired structure is being preserved. Can someone guide me through this?

However, the way I see, I can ignore the paired sampling design and allow free permutations under this circumstance: 'condition_shrub', to test the hypothesis whether the response vary among each species and outside; that is, the response outside the shrubs will always be equal, but will change if under one of the shrubs.

Thanks in advance.


Solution

  • The thing you want to test is shrub so you need to permute the pair variable randomly. This will generate the null hypothesis of no effect of shrub because under this hypothesis it shouldn't matter if we mix pairs of observations from shrub1 with those of shrub2 or shrub3, and the value of shrub only varies at the pair level - it doesn't vary within a pair. So, to make this a valid test, we need to keep fixed (unpermuted) the samples for each pair (the under vs outside samples).

    So I think you need:

    model <- manyglm(response ~ shrub, family = negative.binomial, data = df)
    
    h <- with(
      df,
      how(
        within = Within(type = "none"),
        plots = Plots(strata = pair, type = "free"),
        nperm = 999
      )
    

    Because we are keeping the observations within each pair fixed, we don't include the variation due to under vs outside in the permutational distribution of the test statistics under H0.

    Conversely, if you were interested in the comparison of under verse outside, you would only want to permute the samples within a pair, not between pairs. You would achieve this using:

    model <- manyglm(response ~ condition * shrub,
      family = negative.binomial, data = df)
    
    h <- with(
      df,
      how(
        within = Within(type = "free"),
        blocks = shrub, # not needed but won't hurt
        plots  = Plots(strata = pair, type = "none"),
        nperm  = 999
      )
    

    Here, we probably want to account for differences between shrubs and potentially different condition effects for each shrub, hence the model specification I used. Note also that I have blocked on shrub; in this case I don't think this is necessary because you have uniquely coded your pairs. If you had the pairs coded as pair1, pair2, ... pair10 only, such that we had pair1 for each of the shrubs, we would need to block on shrub to stop samples being swapped between shrubs, which isn't appropriate.