Proportion vs. binary response with pglm

I'm working with panel data comprising several years of observations of schools. My DV is a proportion of exam passers but is not normally distributed, and many observations of the DV are > 0.8. A panel linear model using plm() (from package plm) is therefore inappropriate, so I am trying to treat the DV as a binary response and use logistic regression with pglm() (from package pglm). I have counts of the numbers of test takers and passers.

I have determined that I need to use fixed effects (within-unit) estimation for these data as I'm interested in the average change in exam pass rates within schools. I have far too many observations to post the full dataset but here is a small reproducible example of the error message:

id <- c(1, 1, 1, 2, 2, 2, 3, 3, 3, 4, 4, 4)
year <- rep(c(2017, 2018, 2019), 4)
proportion <- c(.67, .77, .79, .88, .89, .85, .79, .81, .79, .87, .75, .74)
X1 <- c(.05, .041, .037, .015, .012, .021, .081, .055, .062, .034, .031, .022)
X2 <- c(145, 146, 145, 155, 154, 154, 150, 152, 156, 148, 150, 151)
takers <- c(50, 62, 55, 112, 101, 119, 44, 45, 48, 66, 69, 60)
passers <- c(34, 48, 43, 99, 90, 101, 35, 36, 38, 57, 52, 44)
fails <- takers - passers

data <- as.data.frame(cbind(id, year, proportion, X1, X2, takers, passers, fails))

pglm::pglm(cbind(passers, fails) ~ X1 + X2, index = c("id", "year"), model = "within",  family = binomial(link = "logit"), data = data)
#> Error in `.rowNamesDF<-`(x, value = value): duplicate 'row.names' are not allowed

^{Created on 2020-10-21 by the reprex package (v0.3.0)}

I do not encounter an issue running a regular logit:

glm(cbind(passers, fails) ~ X1 + X2,family = binomial(link = "logit"), data = data)

And I am also familiar with an alternative to the treat-DV-as-binary approach, namely the betareg() package which uses beta regression]2, but I do not see a why to use fixed effects with betareg(). I can also run this code using glmer() and setting a random intercept (1|id), but a random effects approach does not make theoretical sense given my research question and a Hausman test indicates I need fixed effects anyway.

My interpretation of the error message is that row names are being duplicated somehow; I ensured this was not the case by setting all row names to NULL but this did not fix the issue:

row.names(data) <- NULL

I also referred to seemingly similar questions on this issue such as this but I have ensured that there are no duplications in the id-year pairing.

So, any help on figuring out what the error is would be much appreciated. Comments on methodology are also welcome, of course.

Solution

The error message about duplicate row names is a bit misleading as pglm cannot handle the specific input glm can handle with a two-column matrix specifying the proporion (cbind(passers, fails) in your code). glm is more flexible regards the various input possiblities, see ?glm.

pglm can only handle a binary dependent variable as input on the left-hand side of the formula. Thus, you want to bring down the data to the "individual level" (Here is some better discussion of the topic with individual outcome (binary response) and group outcome (proportion), using glm http://www.simonqueenborough.info/R/statistics/lessons/Binomial_Data.html; http://pages.stat.wisc.edu/~mchung/teaching/MIA/reading/GLM.logistic.Rpackage.pdf).

Here is some code that gives you the data transformation to replicate the model you estimated using glm with pglm. See how the total number of exam takers (passers and fails) is used to bring the data down to the individual outcome (binary response) from the group level (proportion).

# glm - your reference
summary(mod1 <- glm(cbind(passers, fails) ~ X1 + X2, family = binomial(link = "logit"), data = data))
# glm - same with weights
data$prop <- data$passers / data$takers
summary(mod2 <- glm(prop ~ X1 + X2, family = binomial(link = "logit"), data = data, weights = takers))

# construct data suitable for pglm
df2 <- df[rep(seq_along(data$takers), data$takers), ]
df2$ID <- paste(df2$id, unlist(lapply(df$takers, seq_len)), sep = '')
vec <- numeric()
for (i in 1:nrow(data)) {
    vec  <- c(vec, (c(rep(1, data$passers[i]), rep(0, data$fails[i]))))
}
df2$resp <- vec  
pdf2 <- pdata.frame(df2, index = "id")

# same with pglm
summary(mod3 <- pglm(resp ~ X1 + X2, family = binomial(link = "logit"), data = pdf2, model = "pooling"))

If you what to estimate any other but the "pooling" model, you would need to construct a different index (or otherwise you would get wrong results, I assume) -- for which you might not have the information (individual-time combinations for all lines in pdf2/df2).