Search code examples
rsequencefasta

convert multiple SEQ files to fasta format


Is there a way to convert hundreds of SEQ files to FASTA format

the seq files contain only the sequence in text format.

ATGCGATCGGACTGACTAGCTACGTACG
ACATCCATCATTATTCTATCTATCTATC
ACTATTCATCTATCTTACTATCTTACTC
AATCATTTCATTA

How can I append the file name of each individual text file as the string ID?

I tried applying code from this thread, like this:

files1 <- list.files(pattern = "*.seq")   
files1 
head(files1) 
for (i in 1:length(files1)) {   
  logFile = read.table(paste0(files1[i]))      
  write.table(rbind(paste0(">",files1[i]),logFile),paste0(files1[i],".fa"),row.names = FALSE,col.names = FALSE,quote = FALSE) 
}

but it did not work, the output would just be a +


Solution

  • I had to do this for a .seq file, which is generated by lasergene DNA (or DNAStar I think).

    The format of the .seq is:

    "Contig 2" (1,1412)
      Contig Length:                 1412 bases
      Average Length/Sequence:        757 bases
      Total Sequence Length:         4544 bases
      Top Strand:                       4 sequences
      Bottom Strand:                    2 sequences
      Total:                            6 sequences
    FEATURES             Location/Qualifiers
         contig          1..1412
                         /Note="Contig 2(1>1412)"
                         /dnas_scaffold_ID=0
                         /dnas_scaffold_POS=0
         coverage_below  1..568
                         /Note="Below threshold"
         coverage_one    569..749
                         /Note="One_strand"
         coverage_below  750..1331
                         /Note="Below threshold"
         coverage_one    1332..1412
                         /Note="One_strand"
    
    ^^
    ATGC
    

    The sequence data always proceeded ^^. So I wrote this simple function to read in the .seq file (plan text), and write out fasta file with the file name as the header.

    convert_seq_to_fasta = function(path){
      
      # read in file
      lines = readLines(path)
      # find where ^^ is - fasta data is the next line
      start = which(lines %in% "^^") + 1
      
      # get name and create output name
      file_name = gsub(".seq", "", path)
      output = paste0(file_name, ".fasta")
      
      # create fasta header and store fasta body
      fasta_header = paste0(">", file_name)
      fasta_body = lines[start]
      
      # write out
      cat(fasta_header, file = output, sep = "\n")
      cat(fasta_body, file = output, append = TRUE)
    }
    

    Use it like this:

    seq_files = list.files(pattern = "*.seq$")
    
    for (files in seq_files) {
      convert_seq_to_fasta(files)
    }
    

    This assumes the .seq files are in the same directory as the script (so save it first).

    If your .seq files have this format, assuming the file name is rando.seq:

    ATGCGATCGGACTGACTAGCTACGTACG
    ACATCCATCATTATTCTATCTATCTATC
    ACTATTCATCTATCTTACTATCTTACTC
    AATCATTTCATTA
    

    And you want this output:

    >rando
    ATGCGATCGGACTGACTAGCTACGTACGACATCCATCATTATTCTATCTATCTATCACTATTCATCTATCTTACTATCTTACTCAATCATTTCATT
    

    Which is header + sequence data on one line then you can use this function:

    convert_odd_to_fasta = function(path){
      lines = readLines(path)
      file_name = gsub(".seq", "", path)
      output = paste0(file_name, ".fasta")
      fasta_header = paste0(">", file_name)
      fasta_body = paste0(lines, collapse = '')
      cat(fasta_header, file = output, sep = "\n")
      cat(fasta_body, file = output, append = TRUE)
      
    }
    

    Use it the same as above.

    Hope that helps!