Search code examples
linuxbashawkfasta

Simple FASTA txt separation


I have 2 Fasta.txt files. I would like to make a Bash script that loops in both files and checks if it's the header or the sequence. The header starts with ">". So if it's a header I would like it to copy just the header into an output file. On the other hand if its not the header, but the sequence code, I would like to also copy it into an output in upper case.

So far I made a loop which looks like this:

#!/bin/bash

for f in seq/*.txt;
do
awk 'BEGIN {RS=">"} {print $1}' $f;
awk 'BEGIN {RS=">"} {print toupper ($2)}' $f;
done >> output.txt

I've tried making if/else but the Awk is not willing to work.

My desired output is that all the headers should be under each other, and then comes all the seqeuences in the same file.

How can I fix this?


Solution

  • It's probably easier if you forgo the RS assignment and just process a line at a time. Then you can also process FASTA files where the sequence is split across multiple lines, or where the header line includes whitespace.

    Awk can handle multiple input files, so there is no need for a separate for loop, here or in your original attempt.

    Your original script would print everything to the same output file; so this script does that, too (basically it only converts the sequences to upper case, and leaves the headers alone).

    # (Obsoleted by updated requirements)
    awk '!/^>/ {$0 = toupper($0)}1' seq/*.txt >output.txt
    

    Here's a variant which removes the > from the header like your original attempt did, and writes the headers to a separate file. (The header condition ends with next so if we fall through to the next line in the script, this input line is not a header.)

    awk '/^>/ { sub(/>[[:space:]]*/, ""); print >>"headers.txt"; next; }
    {$0 = toupper($0)}1' seq/*.txt >output.txt
    

    If you want all the headers followed by all the sequences in a single output file, just cat headers.txt output.txt >final.txt when this is done. Perhaps then you would like to also normalize so that every sequence is exactly one line, though. Here's a variation which does that:

    awk '/^>/ {
      if (n) printf "\n"
      n = 0
      sub(/>[[:space:]]*/, "")
      print >>"headers.txt"
      next
    }
    { printf("%s", toupper($1)); n = 1 }
    END { if (n) printf "\n" }' seq/*.txt >sequences.txt
    

    With printf we have complete control over what gets written, so we can inhibit the newline at the end of each line of sequence. We need to print one when we are past the end of the sequence; the n state variable keeps track of this, and we add the missing newline character when we see a new header, or the end of the file.

    (It would not be hard to combine everything into a single Awk script, but that would introduce a serious problem if your input files are big, because we'd need to keep all the sequences in memory until we have processed all the headers, or do two passes over all the input files. So I'll prefer to keep this simple and robust.)

    Demo: https://ideone.com/uH80nn

    As an aside, shell variables which contain file names should generally be double-quoted; see also When to wrap quotes around a shell variable?