I have thousands of files resulting from the alignment performed with STAR for RNA-Seq analysis. Each file is a log ("*Log.final.out") that for each lane (totally 4 lanes per sample) summarises the statistics. Since I have to combine in a unique file all the statistics, I have to extract the following information for each file, for each lane: Number of input reads, Uniquely mapped reads number, and Uniquely mapped reads %. Is there a way to extract for each file all the information I need without manually coping and pasting them one by one?
Here an example of how the Log file looks like:
Started job on | Jul 17 18:34:39
Started mapping on | Jul 17 18:34:39
Finished on | Jul 17 18:35:44
Mapping speed, Million of reads per hour | 507.64
Number of input reads | 9165655
Average input read length | 76
UNIQUE READS:
Uniquely mapped reads number | 7953458
Uniquely mapped reads % | 86.77%
Average mapped length | 73.74
Number of splices: Total | 1924655
Number of splices: Annotated (sjdb) | 1892117
Number of splices: GT/AG | 1909019
Number of splices: GC/AG | 6636
Number of splices: AT/AC | 1016
Number of splices: Non-canonical | 7984
Mismatch rate per base, % | 0.43%
Deletion rate per base | 0.01%
Deletion average length | 1.40
Insertion rate per base | 0.01%
Insertion average length | 1.30
MULTI-MAPPING READS:
Number of reads mapped to multiple loci | 1179823
% of reads mapped to multiple loci | 12.87%
Number of reads mapped to too many loci | 9207
% of reads mapped to too many loci | 0.10%
UNMAPPED READS:
% of reads unmapped: too many mismatches | 0.00%
% of reads unmapped: too short | 0.22%
% of reads unmapped: other | 0.04%
CHIMERIC READS:
Number of chimeric reads | 0
% of chimeric reads | 0.00%
Try this:
path <- <PATH TO *.out FILES>
files <- list.files(path, pattern = ".out")
library(tidyverse)
merge_out <- function (files) {
df <- df <- read.delim(paste0(path, files[1]), header= F) %>%
filter(grepl("Number of input reads", V1) |
grepl("Uniquely mapped reads", V1) |
grepl("Uniquely mapped reads %", V1)) %>%
set_names("Var", "value")
}
results <- lapply(files, merge_out)
let me know if that helped.