Search code examples
rexpressionbioinformaticsbayesianbioconductor

Creating eset object from preprocessed expression matrix?


I am analysing with R some gene expression data. I would like to do differential gene expression analysis with limma's eBayes (limma is part of BioConductor), but to do that I need to have my expression data as an eset object. Thing is, I have only preprocessed data and do not have the CEL files, I could convert directly to eset object. I tried searching from Internet, but couldn't find a solution. Only thing I found, was that it IS possible.

Why eBayes: It should have robust results even with only two or three samples in some of the groups and I do indeed have 3 groups that are from 2 to 3 samples in size.

In detail what I have and want to do: I have expression data, already as logarithmic, normalized intesity values. The data is in expression matrix. There is about 20 000 rows and each row is a gene and the rownames are the official gene names. There is 22 columns and each column corresponds to one cancer sample. I have different kinds of cancer subtypes there and would like to compare for example subtype 1 samples' gene expression to that of the group 2's. Below is a two row, 5 column example of what my matrix would look like.

Example matrix:

       SAMP1   SAMP2   SAMP3   SAMP4   SAMP5
GENE1  123.764 122.476 23.4764 2.24343 123.3124
GENE2  224.233 455.111 124.122 112.155 800.4516

The problem: To evaluate the differential gene expression with eBayes I would need the eset object out of this expression data and I have honestly no idea how to go about that step. :(

I am very grateful for every bit of info that can help me out! If someone can suggest another reliable method for small sample size comparisons, that might solve my problem as well.

Thank you!


Solution

  • Using an ExpressionSet seems to be quite similar to a SummarizedExperiment which is also prevalent in Bioconductor packages. From what I understand, there is nothing special about using one or the other in a package--in my experience, it's considered as a generalized container for data in order to standardize the data set format across Bioconductor packages.

    From the vignette on Bioconductor:

    Affymetrix data will usually be normalized using the affy package. We will assume here that the data is available as an ExpressionSet object called eset. Such an object will have an slot containing the log-expression values for each gene on each array which can be extracted using exprs(eset).

    In other words, there's nothing special about the data for the ExpressionSet. An ExpressionSet is simply a bunch of related experimental data strung together into one, but it appears that I can create a new object just from the regular object:

    library(limma)
    
    # counts is the assay data I already have.
    dim(counts)
    # [1] 64102     8
    
    # Creates a new ExpressionSet object (quite bare, only the assay data)
    asdf <- ExpressionSet(assayData = counts)
    
    # Returns the data you put in.
    exprs(asdf)
    

    This works on my setup.

    The second part that you need to consider is the design of the differential expression analysis comparison model matrix. You will need predefined factors to go along with your samples (probably within a phenoData argument to ExpressionSet and then create a model.matrix using R's special formula syntax. They look similar to: dependent ~ factor1 + factor2 + co:related. Note that a factor1 is a factor category or dimension, not just one level.

    Once you have that, you should be able to run lmFit. I've actually not used limma much before but it appears to be similar to edgeR's scheme.