Search code examples
rloopsfor-loopparsingtibble

Loop on files - Parse Files and group them by identifier


I would like to :

  1. Read the list of * .bed files from the directory
  2. For all the .bed files in my folder, I would like to use the information contained in all the rows id=NAME, part of the fifth column in all the *.bed files (e.g., Hox.bed and zinc.bed below)
  3. Determine which family the given file belongs to (e.g cram-2) using a separate lookup table linking id values to a Family value (e.g., Lookup Table below)
  4. Combine/concatenate all the files with the same family (e.g HOX.bed and zinc.bed) into one .bed file.
  5. Save the linked file with the name of the column Family (e.g cram-2.bed).

Example:

The HOX.bed file rows :

ma  reg out fim id=HOX;seq=AGCAGGAAATA;score=12.1915;pval=4.97e-05
se  reg out fim id=HOX;seq=AGCAGGAAATA;score=12.1915;pval=4.97e-05
to  reg out fim id=HOX;seq=AGCAGGAAATA;score=12.1915;pval=4.97e-05
pa  reg out fim id=HOX;seq=AGCAGGAAATA;score=12.1915;pval=4.97e-05

The zinc.bed file rows :

ma  reg out fim id=zinc;seq=AGCAGGAAATA;score=12.1915;pval=4.97e-05
se  reg out fim id=zinc;seq=AGCAGGAAATA;score=12.1915;pval=4.97e-05
to  reg out fim id=zinc;seq=AGCAGGAAATA;score=12.1915;pval=4.97e-05
pa  reg out fim id=zinc;seq=AGCAGGAAATA;score=12.1915;pval=4.97e-05

The lookup table :

Name                        Family
HOX                         cram-2
zinc                        cram-2
fire                        sf.xr
fire                        ra.XS-2
...continues...

the output I search to obtain :

File name = cram-2.bed

Concatenate HOX.bed and zinc.bed because both are from Family cram-2!

ma  reg out fim id=HOX;seq=AGCAGGAAATA;score=12.1915;pval=4.97e-05
se  reg out fim id=HOX;seq=AGCAGGAAATA;score=12.1915;pval=4.97e-05
to  reg out fim id=HOX;seq=AGCAGGAAATA;score=12.1915;pval=4.97e-05
pa  reg out fim id=HOX;seq=AGCAGGAAATA;score=12.1915;pval=4.97e-05
ma  reg out fim id=zinc;seq=AGCAGGAAATA;score=12.1915;pval=4.97e-05
se  reg out fim id=zinc;seq=AGCAGGAAATA;score=12.1915;pval=4.97e-05
to  reg out fim id=zinc;seq=AGCAGGAAATA;score=12.1915;pval=4.97e-05
pa  reg out fim id=zinc;seq=AGCAGGAAATA;score=12.1915;pval=4.97e-05

I started to prepare a script structure but I am struggling in how to set up that all the files with same Family will have to end up in the same output file (.bed possibly)

myFiles <- list.files(pattern = "\\.bed$") 
for(i in myFiles){
  name <- read.table((i), header = FALSE, sep="\t", stringsAsFactors=FALSE, quote="")
  name <- name %>% top_n(1, "id")
  Family_filtering <-
    table %>% filter(
      Family %in% name)
  save(...????????...)
}

Thank you a lot for the help!!!


Solution

  • Convert each activity into one function and then combine it all together. Simple isn't it?!?

    library(fs)
    library(tidyverse)
    
    dfNameFamily = tibble(
      Name = c("HOX", "zinc", "fire", "fire2"),
      Family = c("cram-2", "cram-2", "sf.xr", "ra.XS-2"))
    
    dir = "bedfile"
    
    BedFile = function(dir) dir_ls(dir, regexp = "\\.bed$")
    
    readTxt = function(FileName){
      lines = character()
      if(file_exists(FileName)){
        con = file(FileName, open = "r")
        lines = readLines(con)
        close(con)
      }
      lines
    }
    
    GetName = function(l) str_match(l, "id=(.+);seq")[1,2]
    
    SaveFile = function(l, name, dir){
      con = file(paste0(dir, "/" , name))
      writeLines(unlist(l$lines), con)
      close(con)
    }
    
    tibble(FileName = BedFile(dir)) %>%  #Read all bed file names
      mutate(
        lines = map(FileName, readTxt),  #Read all lines from any bed file
        Name = map_chr(lines, GetName)) %>%  #Get Name for eny bed file
      left_join(dfNameFamily, by="Name") %>%  #Join Family
      group_by(Family) %>%  
      group_walk(SaveFile, dir)  #Save Family file