I am analysing with R some gene expression data. I would like to do differential gene expression analysis with limma's eBayes (limma is part of BioConductor), but to do that I need to have my expression data as an eset object. Thing is, I have only preprocessed data and do not have the CEL files, I could convert directly to eset object. I tried searching from Internet, but couldn't find a solution. Only thing I found, was that it IS possible.
Why eBayes: It should have robust results even with only two or three samples in some of the groups and I do indeed have 3 groups that are from 2 to 3 samples in size.
In detail what I have and want to do: I have expression data, already as logarithmic, normalized intesity values. The data is in expression matrix. There is about 20 000 rows and each row is a gene and the rownames are the official gene names. There is 22 columns and each column corresponds to one cancer sample. I have different kinds of cancer subtypes there and would like to compare for example subtype 1 samples' gene expression to that of the group 2's. Below is a two row, 5 column example of what my matrix would look like.
Example matrix:
SAMP1 SAMP2 SAMP3 SAMP4 SAMP5
GENE1 123.764 122.476 23.4764 2.24343 123.3124
GENE2 224.233 455.111 124.122 112.155 800.4516
The problem: To evaluate the differential gene expression with eBayes I would need the eset object out of this expression data and I have honestly no idea how to go about that step. :(
I am very grateful for every bit of info that can help me out! If someone can suggest another reliable method for small sample size comparisons, that might solve my problem as well.
Thank you!
Using an ExpressionSet
seems to be quite similar to a SummarizedExperiment
which is also prevalent in Bioconductor packages. From what I understand, there is nothing special about using one or the other in a package--in my experience, it's considered as a generalized container for data in order to standardize the data set format across Bioconductor packages.
From the vignette on Bioconductor:
Affymetrix data will usually be normalized using the affy package. We will assume here that the data is available as an ExpressionSet object called eset. Such an object will have an slot containing the log-expression values for each gene on each array which can be extracted using exprs(eset).
In other words, there's nothing special about the data for the ExpressionSet. An ExpressionSet
is simply a bunch of related experimental data strung together into one, but it appears that I can create a new object just from the regular object:
library(limma)
# counts is the assay data I already have.
dim(counts)
# [1] 64102 8
# Creates a new ExpressionSet object (quite bare, only the assay data)
asdf <- ExpressionSet(assayData = counts)
# Returns the data you put in.
exprs(asdf)
This works on my setup.
The second part that you need to consider is the design of the differential expression analysis comparison model matrix. You will need predefined factors to go along with your samples (probably within a phenoData
argument to ExpressionSet
and then create a model.matrix
using R's special formula
syntax. They look similar to: dependent ~ factor1 + factor2 + co:related
. Note that a factor1
is a factor category or dimension, not just one level.
Once you have that, you should be able to run lmFit
. I've actually not used limma
much before but it appears to be similar to edgeR
's scheme.