Search code examples
rloopsrna-seq

Extracting read counts matched to each gene symbol


I have quantified gene expression by Salmon that gives me Ensembl transcripts, I converted Ensembl transcripts to gene symbol but for some genes I multiple transcripts; How I could collapse read counts to genes, I tried tximport package but I found that too hard as my annotation is different.

Name                NumReads
ENST00000355520.5   407.186
ENST00000566753.1   268.879
ENST00000481617.2   242.25
ENST00000538183.2   226.576

ensembltranscript_id gene_name
ENST00000482226.2   FCGR2C
ENST00000508651.1   FCGR2C
ENST00000571914.1   TSPAN10
ENST00000571707.1   TSPAN10
ENST00000534817.1   OVCH2
ENST00000445557.1   OR52E1
ENST00000575319.1   CYP2D7
ENST00000576465.1   CYP2D7

EDITED

This is output of Salmon read counts

https://www.dropbox.com/s/7bkril0v6sw7v9z/Salmon_output.txt?dl=0

And this is when I converted transcript ids in Salmon output to gene name

https://www.dropbox.com/s/m1iybfbu2i4bb39/Converting_transcript_id_to_gene_id.txt?dl=0


Solution

  • You can use the package dplyr.

    Create test table:

    names = c("ensembltranscript_id",  "gene_name", "NumReads")
    transcripts = c("ENST00000482226.2", "ENST00000508651.1", "ENST00000571914.1", "ENST00000571707.1", "ENST00000534817.1")
    gene_names = c("FCGR2C", "FCGR2C", "TSPAN10", "TSPAN10", "OVCH2")
    reads = c(205.56, 456.21, 123.3, 52.6, 268.45)
    
    data = data.frame(transcripts, gene_names, reads)
    names(data) = names
    

    Do the calculation:

    result = data %>% 
      group_by(gene_name) %>% 
      summarise(sum(NumReads)) %>%
      mutate_if(is.numeric, format, 2)
    

    Print the result:

    # A tibble: 3 x 2
      gene_name `sum(NumReads)`
      <fct>     <chr>          
    1 FCGR2C    661.77         
    2 OVCH2     268.45         
    3 TSPAN10   175.90
    

    Hope this helps.

    Edit:

    As stated in the comments of the OP, an expected output would help. Sorry, maybe I misunderstood 'collapse' in this context. My interpretation is in adding up the reads per gene-name.

    Edit2:

    As mentioned in my comment, try to prevent to provide links. Links can be broken etc. For full instructions on how to write a good post see: here.

    However, based on your real data do the following:

    Load the data:

    salmon_reads = read.table(file = "/path/to/Salmon_output.txt", header = T, sep = "\t")
    genes = read.table(file = "/path/to/Converting_transcript_id_to_gene_id.txt", header = T, sep = "\t")
    

    Simply merge the data by there transcript-id:

    merged_data = merge(x = salmon_reads, y = genes, by.x = colnames(salmon_reads)[1], by.y = colnames(genes)[1], all = T)
    

    Do the calculation and order for decreasing reads:

    result = merged_data %>% 
      group_by(external_gene_name) %>% 
      summarise(sum(NumReads)) %>%
      mutate_if(is.numeric, format, 2)
    
    result$`sum(NumReads)` = as.numeric(result$`sum(NumReads)`)
    result = result[order(result$`sum(NumReads)`, decreasing = T),]
    

    You did not mention how to handle NAs. In this scenario all reads for gene-names which are NA are summed up. This is why NA has the most reads.