Search code examples
rloopsunixreadr

Grep summary statistics from Log files


I have thousands of files resulting from the alignment performed with STAR for RNA-Seq analysis. Each file is a log ("*Log.final.out") that for each lane (totally 4 lanes per sample) summarises the statistics. Since I have to combine in a unique file all the statistics, I have to extract the following information for each file, for each lane: Number of input reads, Uniquely mapped reads number, and Uniquely mapped reads %. Is there a way to extract for each file all the information I need without manually coping and pasting them one by one?

Here an example of how the Log file looks like:

                             Started job on |   Jul 17 18:34:39
                         Started mapping on |   Jul 17 18:34:39
                                Finished on |   Jul 17 18:35:44
   Mapping speed, Million of reads per hour |   507.64

                      Number of input reads |   9165655
                  Average input read length |   76
                                UNIQUE READS:
               Uniquely mapped reads number |   7953458
                    Uniquely mapped reads % |   86.77%
                      Average mapped length |   73.74
                   Number of splices: Total |   1924655
        Number of splices: Annotated (sjdb) |   1892117
                   Number of splices: GT/AG |   1909019
                   Number of splices: GC/AG |   6636
                   Number of splices: AT/AC |   1016
           Number of splices: Non-canonical |   7984
                  Mismatch rate per base, % |   0.43%
                     Deletion rate per base |   0.01%
                    Deletion average length |   1.40
                    Insertion rate per base |   0.01%
                   Insertion average length |   1.30
                         MULTI-MAPPING READS:
    Number of reads mapped to multiple loci |   1179823
         % of reads mapped to multiple loci |   12.87%
    Number of reads mapped to too many loci |   9207
         % of reads mapped to too many loci |   0.10%
                              UNMAPPED READS:
   % of reads unmapped: too many mismatches |   0.00%
             % of reads unmapped: too short |   0.22%
                 % of reads unmapped: other |   0.04%
                              CHIMERIC READS:
                   Number of chimeric reads |   0
                        % of chimeric reads |   0.00%

Solution

  • Try this:

    path <- <PATH TO *.out FILES>
    files <- list.files(path, pattern = ".out")
    
    library(tidyverse)
    merge_out <- function (files) {
      df <- df <- read.delim(paste0(path, files[1]), header= F) %>% 
        filter(grepl("Number of input reads", V1) |
               grepl("Uniquely mapped reads", V1) |
               grepl("Uniquely mapped reads %", V1)) %>% 
        set_names("Var", "value")
    }
    
    results <- lapply(files, merge_out)
    

    let me know if that helped.