Search code examples
rbioinformaticsrna-seq

Questions about DESeq2 and setting up logical vectors to select a result file


I am very new to use DESeq2 and R language for processing sequencing data. I have R and R-studio installed successfully, together with all necessary packages. But when I followed some tutorial to understand the design and contrast, I found myself lost. Here is the link to the tutorial: https://www.r-bloggers.com/2024/05/a-guide-to-designs-and-contrasts-in-deseq2/

My specific question is -- following the instruction, I got a result file called res.

res <- results(dds1, contrast = list("conditiontreatment", "conditioncontrol"))

Here I get res with a dimension of (1000,6)

I want to select the genes that have a p-value < 0.05 in the result file (res), so I set up the following logical vector: significant_p <- res$pvalue < 0.05 when I tried to use this logical vector to select the res file, it said that there are na values in res, so I dropped all the na values by doing the following: filtered_res <- na.omit(res)

Now I can use the logical vector on the filtered_res to select for the genes' expression with a pvalue < 0.05. However, the system returned the following error:

filtered_res[significant_p]
Error: subscript is a logical vector with out-of-bounds TRUE values

But when I used the logical vector with dds1, then it worked.

dds1[significant_p]
class: DESeqDataSet 
dim: 465 6 
metadata(1): version
assays(4): counts mu H cooks
rownames(465): gene1 gene2 ... gene996 gene997
rowData names(25): trueIntercept trueBeta ... deviance maxCooks
colnames(6): sample1 sample2 ... sample5 sample6
colData names(2): condition sizeFactor

I don't quite understand the datastructure here very well. Obviously I hope that the logical vector would work on the res file, instead of the original file dds1. Can anyone help explain a little to me?

Thank you very much!


Solution

  • One potential issue is:

    significant_p <- res$pvalue < 0.05
    

    This is a logical evaluation, 'is the pvalue < 0.05?', and it will return TRUE / FALSE / NA for each row.

    Instead, try:

    significant_p <- res[res$pvalue < 0.05 & !is.na(res$pvalue),]
    

    E.g. with a minimal reproducible example:

    res <- data.frame(pvalue = c(0.01, 0.02, NA, 0.07),
                      values = c(4, 5, 6, 7),
                      gene = c("gene_A", "gene_B", "gene_C", "gene_D"))
    
    res
    #>   pvalue values   gene
    #> 1   0.01      4 gene_A
    #> 2   0.02      5 gene_B
    #> 3     NA      6 gene_C
    #> 4   0.07      7 gene_D
    
    # result for your existing code
    res$pvalue < 0.05
    #> [1]  TRUE  TRUE    NA FALSE
    
    # a different approach to select all rows that are < 0.05 and not NA
    res[res$pvalue < 0.05 & !is.na(res$pvalue),]
    #>   pvalue values   gene
    #> 1   0.01      4 gene_A
    #> 2   0.02      5 gene_B
    
    # or, if you only want the p-values that are < 0.05
    res[res$pvalue < 0.05 & !is.na(res$pvalue),]$pvalue
    #> [1] 0.01 0.02
    
    # or, if you only want the gene names for rows that have p-value < 0.05
    res[res$pvalue < 0.05 & !is.na(res$pvalue),]$gene
    #> [1] "gene_A" "gene_B"
    

    Created on 2024-09-19 with reprex v2.1.0

    Does that help?