I did not find anything similar to my issue.
I have single cell expression data from 20 different experiments (29733genes x 24489 cells
). We want to see if there are differences in gene expression between levels in a factor cell_types
(4 levels) by each level of another variable that has 3 levels (for discretion lets call it factor2
).
The objective is to obtain differentially expressed genes when testing these contrasts:
contrast_matrix <- makeContrasts(
"Cell_typeAvsOther_Factor2:level1" = Cell_typeA_Factor2_level1 - (Cell_typeB_Factor2_level1 + Cell_typeC_Factor2_level1 + Cell_typeD_Factor2_level1) / 3,
"Cell_typeAvsOther_Factor2:level2" = Cell_typeA_Factor2_level2 - (Cell_typeB_Factor2_level2 + Cell_typeC_Factor2_level2 + Cell_typeD_Factor2_level2) / 3,
"Cell_typeAvsOther_Factor2:level3" = Cell_typeA_Factor2_level3 - (Cell_typeB_Factor2_level3 + Cell_typeC_Factor2_level3 + Cell_typeD_Factor2_level3) / 3,
"Cell_typeBvsOther_Factor2:level1" = Cell_typeB_Factor2_level1 - (Cell_typeA_Factor2_level1 + Cell_typeC_Factor2_level1 + Cell_typeD_Factor2_level1) / 3,
"Cell_typeBvsOther_Factor2:level2" = Cell_typeB_Factor2_level2 - (Cell_typeA_Factor2_level2 + Cell_typeC_Factor2_level2 + Cell_typeD_Factor2_level2) / 3,
"Cell_typeBvsOther_Factor2:level3" = Cell_typeB_Factor2_level3 - (Cell_typeA_Factor2_level3 + Cell_typeC_Factor2_level3 + Cell_typeD_Factor2_level3) / 3,
"Cell_typeCvsOther_Factor2:level1" = Cell_typeC_Factor2_level1 - (Cell_typeB_Factor2_level1 + Cell_typeA_Factor2_level1 + Cell_typeD_Factor2_level1) / 3,
"Cell_typeCvsOther_Factor2:level2" = Cell_typeC_Factor2_level2 - (Cell_typeB_Factor2_level2 + Cell_typeA_Factor2_level2 + Cell_typeD_Factor2_level2) / 3,
"Cell_typeCvsOther_Factor2:level3" = Cell_typeC_Factor2_level3 - (Cell_typeB_Factor2_level3 + Cell_typeA_Factor2_level3 + Cell_typeD_Factor2_level3) / 3,
"Cell_typeDvsOther_Factor2:level1" = Cell_typeD_Factor2_level1 - (Cell_typeB_Factor2_level1 + Cell_typeA_Factor2_level1 + Cell_typeC_Factor2_level1) / 3,
"Cell_typeDvsOther_Factor2:level2" = Cell_typeD_Factor2_level2 - (Cell_typeB_Factor2_level2 + Cell_typeA_Factor2_level2 + Cell_typeC_Factor2_level2) / 3,
"Cell_typeDvsOther_Factor2:level3" = Cell_typeD_Factor2_level3 - (Cell_typeB_Factor2_level3 + Cell_typeA_Factor2_level3 + Cell_typeC_Factor2_level3) / 3,
levels = design_matrix
)
To simplify my model I combined cell type and factor2 into a unified factor.
Then we have batch factor. Here I show you the distribution of cell types across batches. As you can see some batches are celltype specific:
celltypeA celltypeB celltypeC celltypeD
batch_1 0 0 441 0
batch_10 0 63 3 58
batch_11 181 13 7 11
batch_12 2 36 2 1144
batch_13 0 16 2 876
batch_14 58 265 3226 469
batch_15 19 66 115 858
batch_16 0 73 100 1996
batch_17 0 29 783 1
batch_18 2 89 192 884
batch_19 1152 459 33 71
batch_2 193 3 0 0
batch_20 0 2219 2 2
batch_3 198 63 0 0
batch_4 8 3 3 208
batch_5 8 12 499 2
batch_6 1878 465 8 1
batch_7 3642 72 293 46
batch_8 0 206 0 1
batch_9 602 52 2 3
Using a fix effects model with a design matrix such as :
design_matrix <- model.matrix(
~ 0 + Celltype_factor2 + Batch ,
data = cellinfo,
contrasts.arg = lapply(cellinfo[, sapply(cellinfo, is.factor), drop = FALSE],
contrasts, contrasts = FALSE))
Was giving me a design matrix that was not full rank so it was making me remove one of the coefficients, which I feel is not appropiate.
Additionally, I kept reading many questions and threads about this and I arrived to the conclusion that batch should be a random effect. Mainly because I do not want to just analyze the differences in expression for these specific batches. I tried to implement the following mixed-effects model using lme4
package(1.1-34)in R (R version 4.3.1 (2023-06-16 ucrt)
).
model <- lmer(gene_expression ~ Celltype_factor2 + (1 | Batch), data=cellinfo)
But I am getting the following error (note: I ran it in a subset of my data. I tried subsetting only cells, subsetting both genes and cells. I checked for NAs, all the levels were also included, etc...):
Error in !is.na(v.e) && v.e > 0 :
'length = 90000' in coercion to 'logical(1)'
I did not find anybody having this issue. I would appreciate some insight, either model-wise or error-wise.
This really ought to have a more informative error message (it does now in the development version), but the problem is that (unlike lm()
), lmer
can't handle matrix-valued response variables. Example:
dd <- data.frame(x = rnorm(1000), batch = factor(rep(1:20, each=50)))
dd$y <- matrix(rnorm(1e4), ncol = 10)
lmer(y ~ x + (1|batch), dd)
Error in !is.na(v.e) && v.e > 0 : 'length = 100' in coercion to 'logical(1)'
The proper way to handle this case is by using refit
:
ans <- vector("list", length = ncol(dd$y))
ans[[1]] <- lmer(y[,1] ~ x + (1|batch), dd)
for (i in 2:ncol(dd$y)) {
ans[[i]] <- refit(ans[[1]], newresp = dd$y[,i])
}
(The example in ?refit
uses lapply()
, but I think this might be clearer ...)