I have success/failure data (trees that survived/died over a certain period) and would like to estimate an error from a binomial distribution to be associated to each of my observations (7 sites). So far I have been using glm
to do so:
s <- c(1,20,0,40,2,1,0) # success
f <- c(2,0,20,4,50,0,1) # failure
#for each observation I would calculate this error:
error <- vector ()
z_scores <- vector ()
p_value <- vector ()
for (i in 1:7) {
models <- glm (cbind (s[i], f[i]) ~ 1, family = 'binomial')
error [i] <- summary (models)$coefficients[2]
z_scores [i] <- summary (models)$coefficients[3]
p_value [i] <- summary (models)$coefficients[4]
}
Would this be the best approach?
How is the probability of the binomial distribution estimated here?
Note that regardless the number of success and failure my error is extremely high when either s
or f
are =0
Here is some code to recompute most of your results (except the extremes ones caused by zero) without using of glm
, and I explain the meaning behind them.
s <- c(1, 20, 0, 40, 2, 1, 0) # success
f <- c(2, 0, 20, 4, 50, 0, 1) # failure
#for each observation I would calculate this error:
error <- vector()
z_scores <- vector()
p_value <- vector()
for (i in 1:7) {
models <- glm(cbind(s[i], f[i]) ~ 1, family = 'binomial')
error[i] <- summary(models)$coefficients[2]
z_scores[i] <- summary(models)$coefficients[3]
p_value[i] <- summary(models)$coefficients[4]
}
logit <- function(x){
log(x / (1 - x))
}
dlogit <- function(x){
1 / x / (1 - x)
}
p_hat <- s / (s + f)
## sqrt(p_hat * (1 - p_hat) / (s + f))
## is the standard error of p_hat
## error1 is the standard error of logit(p_hat)
error1 <- dlogit(p_hat) * sqrt(p_hat * (1 - p_hat) / (s + f))
## divide the estimation by the standard error, you get z-score
z_scores1 <- logit(p_hat) / error1
p_value1 <- 2 * pnorm(-abs(z_scores1))
The first thing you need to know is the rationale behind the standard error, z-score, p-value and etc. In statistics, we first have some model (in this case, Binomial model: s|(s+f) ~ Binomial(s + f, p))
and we want to use it to fit the data we have and
1) get estimations (p
in this case)
2) since data is generated randomly, we want to know how good is our estimate, here comes standard error, z-scores and p-value to "measure the randomness in the estimation", and here is some important "trick": since we don't know the true mechanism that generates the data, we can only approximately calculate the randomness in our estimation by making assumptions
a) our model is (or similar to) the true mechanism of data generation and
b) the real parameter is similar to our estimation (this often requires large sample size, in this case, sample size is just s + f
, so s + f
must be large enough to make the inference (standard error, z-score and p-value) validated). And we can see that in case i = 1, 6 and 7, the sample size is really small, which makes the corresponding standard errors, z-scores and p-values incredible.
And then I can talk about the technical details behind my calculations and what do they mean. In glm
, besides a Binomial(n, p)
model, you also assume a model for p
like this:
logit(p) ~ N(mu, sigma^2)
And the logit function is just like that in my code.
In this simple case, the estimation for Binomial probability p
is just p_hat <- s / (s + f)
(whether use glm
or not), and from the variance formula for Binomial variable, we can get the variance for the estimated probability p
is p * (1 - p) / n
, here if we think p_hat <- s / (s + f)
is similar to the real p
by the assumption b, and use it to replace p
, we can get standard error for the estimated p
. Following CLT and Delta method, when the sample size is large enough, we can treat s / (s + f)
or logit(s / (s + f))
as following a normal distribution, for example, s / (s + f)
is approximately N(p, s * f / (s + f) ^ 3)
and logit(s / (s + f))
is approximately N(logit(p), dlogit(s / (s + f)) ^ 2 * s * f / (s + f) ^ 3)
.
Simply speaking, the standard error, z-scores and p-values that glm
calculates are just the standard error, z-scores and p-values for logit(s / (s + f))
. These are valid results for the null hypothesis: logit(p) = 0
, in other words, p = 0.5
. So the z-scores and p-values obtained from glm
is to test whether or not s
and f
happens with equal probability when the sample size s + f
is large.
And then I will talk about the extreme values caused by 0. When s
or f
equals 0, the estimated probability of f
or s
happens will be 1, if this is true, the data generation mechanism is actually non-random!! At the beginning I have said that we use our estimations to approximately calculate the randomness in our estimations, and in the case that s
or f
equals 0, if we use our estimations as the ground truth, we should believe our estimations in 100% percent, which is kind of ridiculous. And in such cases, a lot of methods like glm
will not be valid. Generally speaking, if the sample size s + f
is big enough, we believe that the probability of s
or f
happens is really small if s = 0
or f = 0
, but if the sample size is really small like in case 6 or 7, we actually cannot reach any conclusion.
In sum, if the binomial model is true, from the glm
result, my code and my analysis as provided above, we can say that in case i = 2, 3, 4, 5
, the probability of s
and f
is different from each other significantly.