Search code examples
rbioinformaticsfasta

How to combine FASTA sequence according to sequence id?


I have 9 FASTA files, representing the DNA sequencing of 9 genes.

Each FASTA file contains 121 sequences ,representing 121 strains. The name for each sequence is the id for each strain.

However, in each file, the id is not sorted, for example, in gene1.fasta:

>1
AAA
>16
TTT
>2
GGG
...

In gene2.fasta:

>2
CCC
>34
AAA
>1
GGG
...

I want to change these 9 genes FASTA files into 121 strains FASTA files, in each file, simply combine 9 genes for one strain. For example, in strain1.fasta:

AAAGGG

in strain2.fasta:

GGGCCC

How can I do this in R?


Solution

  • Here is a solution in R as requested, using the Biostrings package to read fasta files.

    It works, but I have to say this is some of the ugliest code I have written in a long time. I just wanted to see if I could get this somehow done - this is 100% not the best solution.

    library("Biostrings")
    library("tidyverse")
    
    convertStringSet = function(seq){
      return(df = data.frame("names" = names(seq), "seq" = paste(seq)))
    }
    
    # change the path accordingly
    filenames = list.files("/home/x/y/z", pattern="gene*", full.names=TRUE)%>%
      lapply(readDNAStringSet)
    
    fastaDF = filenames %>% lapply(convertStringSet) %>% 
      reduce(full_join, by = "names") %>% 
      unite("seq", -1,  sep="")
    
    writeOutput = function(x){
    
      header = paste(">",x[1],sep="")
      fileName = paste("strain",x[1],".fasta",sep="")
    
      writeLines(c(header,x[2]), fileName)
    }
    
    apply(fastaDF, 1, writeOutput)
    

    As an alternative, if you are on a unix system, this awk line should give the same results:

    awk '$0 ~ /^>/ {i=substr($0,2); next;} i != -1 {printf "%s", $0 >> "file"i; i=-1;}' gene*