Background
The multi fasta format contains several record of sequences, each record begins with a single-line description, followed by several lines of sequence (RNA, DNA, protein). The description line has greaterthan symbol in the beginning, following ">" is the identifier of the sequence, and the rest of the line contains the description of record (both are optional).
in fasta files is common for lines sequences be formatted to a max width
input example, with max width="70":
>gi|304322925|ref|YP_003856771.1| NADH [Lynx rufus] MMTYIVFILSTIFVVSFVGFSSKPSPIYGGFGLIVAGGIGCGIVLNFGGSFLGLMVFLIYLGGMLVVFGY TTAMATEPYPEAWTSNKAVLGMLITGILAELLTACYILKEDEIEVVFKFNGAGDWVIYDTGDSGFFSEEA MGIAALYSYGTWLVVVTGWSLLIGVLVIMEVTRGN >gi|295065592|ref|YP_003587393.1| NADH [Nomascus siki] MTYTLFLLSVILVMGFVGFSSKPSPIYGGLVLVVSGVVGCAVILNCGGGYLGLMVFLIYLGGMMVVFGYT TAMAIEEYPEAWGSGVEVLVGVLVGLAMEVGLVLWAKECDGLVMVLNFNNMGSWVIYEGEGSGLIREDSI GAGALYDYGRWLVVVTGWTLLVGVYIVIEIARGN >gi|295065550|ref|YP_003587316.1| NADH (mitochondrion) [Symphalangus syndactylus] MTYTLFLLSVILVMGFVGFSSKPSPIYGGLVLVVSGVVGCAIILDCGGGYLGLMVFLIYLGGMMVVFGYT TAMAIEEYPEAWGSGVEVLVGVLVGLAMEVGLVLWAKEYDGLVVVLNFNNMGSWVIYEGEGSGLIREDSI GAGALYDYGRWLVVVTGWTLLVGVYIVIEIARGN
My effort
I wrote a gawk script, it does left trim in sequences (cut first N bases) and if the length of sequence is less than trim then not print it.
gawk -v left_trim=15 -v width=70 '
BEGIN{RS=">"}
NR==1{next}; #for blank record
{
split ($0, a, "\n");
sequence="";
for(i=2; i<=length(a); i++){
sequence=sequence a[i];
};
if(length(sequence)>left_trim) {
printf(">%s\n%s\n",a[1], substr(sequence, left_trim+1));
}
}' test.fasta
result
>gi|304322925|ref|YP_003856771.1| NADH [Lynx rufus] SFVGFSSKPSPIYGGFGLIVAGGIGCGIVLNFGGSFLGLMVFLIYLGGMLVVFGYTTAMATEPYPEAWTSNKAVLGMLITGILAELLTACYILKEDEIEVVFKFNGAGDWVIYDTGDSGFFSEEAMGIAALYSYGTWLVVVTGWSLLIGVLVIMEVTRGN >gi|295065592|ref|YP_003587393.1| NADH [Nomascus siki] FVGFSSKPSPIYGGLVLVVSGVVGCAVILNCGGGYLGLMVFLIYLGGMMVVFGYTTAMAIEEYPEAWGSGVEVLVGVLVGLAMEVGLVLWAKECDGLVMVLNFNNMGSWVIYEGEGSGLIREDSIGAGALYDYGRWLVVVTGWTLLVGVYIVIEIARGN >gi|295065550|ref|YP_003587316.1| NADH (mitochondrion) [Symphalangus syndactylus] FVGFSSKPSPIYGGLVLVVSGVVGCAIILDCGGGYLGLMVFLIYLGGMMVVFGYTTAMAIEEYPEAWGSGVEVLVGVLVGLAMEVGLVLWAKEYDGLVVVLNFNNMGSWVIYEGEGSGLIREDSIGAGALYDYGRWLVVVTGWTLLVGVYIVIEIARGN
expected
>gi|304322925|ref|YP_003856771.1| NADH [Lynx rufus] SFVGFSSKPSPIYGGFGLIVAGGIGCGIVLNFGGSFLGLMVFLIYLGGMLVVFGYTTAMATEPYPEAWTS NKAVLGMLITGILAELLTACYILKEDEIEVVFKFNGAGDWVIYDTGDSGFFSEEAMGIAALYSYGTWLVV VTGWSLLIGVLVIMEVTRGN >gi|295065592|ref|YP_003587393.1| NADH [Nomascus siki] FVGFSSKPSPIYGGLVLVVSGVVGCAVILNCGGGYLGLMVFLIYLGGMMVVFGYTTAMAIEEYPEAWGSG VEVLVGVLVGLAMEVGLVLWAKECDGLVMVLNFNNMGSWVIYEGEGSGLIREDSIGAGALYDYGRWLVVV TGWTLLVGVYIVIEIARGN >gi|295065550|ref|YP_003587316.1| NADH (mitochondrion) [Symphalangus syndactylus] FVGFSSKPSPIYGGLVLVVSGVVGCAIILDCGGGYLGLMVFLIYLGGMMVVFGYTTAMAIEEYPEAWGSG VEVLVGVLVGLAMEVGLVLWAKEYDGLVVVLNFNNMGSWVIYEGEGSGLIREDSIGAGALYDYGRWLVVV TGWTLLVGVYIVIEIARGN
Questions
how may i implement width format? i try with ">%s\n%.70s\n"
, print first 70 letters, ">%s\n%70s\n"
print all
how may i improve this join? exits a function?
sequence="";
for(i=2; i<=length(a); i++){
sequence=sequence a[i];
};
To answer your specific questions, you can specify the width of an output field using the *
format modifier:
$ awk 'BEGIN{printf "%s\n", "foo"}'
foo
$ awk 'BEGIN{printf "%*s\n", 10, "foo"}'
foo
and no, there is no join
function to put arrays back together into a string (the opposite of split()
) BUT if you just let awk split the record into fields instead of you manually splitting the record into an array of elements, then you can just assign a new value to any field and awk will recompile the fields into $0 as I'm doing as a deliberate side-effect in my first solution below when I delete the first field/line with $1 = ""
.
Here's how I'd approach the task anyway:
$ cat tst.awk
BEGIN { RS=">"; FS="\n"; OFS="" }
NR>1 {
print RS $1
$1 = ""
for ( start=left_trim+1; start<=length(); start+=width ) {
print substr($0,start,width)
}
}
$ awk -v left_trim=15 -v width=70 -f tst.awk file
>gi|304322925|ref|YP_003856771.1| NADH [Lynx rufus]
SFVGFSSKPSPIYGGFGLIVAGGIGCGIVLNFGGSFLGLMVFLIYLGGMLVVFGYTTAMATEPYPEAWTS
NKAVLGMLITGILAELLTACYILKEDEIEVVFKFNGAGDWVIYDTGDSGFFSEEAMGIAALYSYGTWLVV
VTGWSLLIGVLVIMEVTRGN
>gi|295065592|ref|YP_003587393.1| NADH [Nomascus siki]
FVGFSSKPSPIYGGLVLVVSGVVGCAVILNCGGGYLGLMVFLIYLGGMMVVFGYTTAMAIEEYPEAWGSG
VEVLVGVLVGLAMEVGLVLWAKECDGLVMVLNFNNMGSWVIYEGEGSGLIREDSIGAGALYDYGRWLVVV
TGWTLLVGVYIVIEIARGN
>gi|295065550|ref|YP_003587316.1| NADH (mitochondrion) [Symphalangus syndactylus]
FVGFSSKPSPIYGGLVLVVSGVVGCAIILDCGGGYLGLMVFLIYLGGMMVVFGYTTAMAIEEYPEAWGSG
VEVLVGVLVGLAMEVGLVLWAKEYDGLVVVLNFNNMGSWVIYEGEGSGLIREDSIGAGALYDYGRWLVVV
TGWTLLVGVYIVIEIARGN
or if you prefer:
$ cat tst.awk
/^>/ { prtRec(); rec=""; print; next }
{ rec = rec $0 }
END { prtRec() }
function prtRec( start) {
for ( start=left_trim+1; start<=length(rec); start+=width ) {
print substr(rec,start,width)
}
}
$ awk -v left_trim=15 -v width=70 -f tst.awk file
>gi|304322925|ref|YP_003856771.1| NADH [Lynx rufus]
SFVGFSSKPSPIYGGFGLIVAGGIGCGIVLNFGGSFLGLMVFLIYLGGMLVVFGYTTAMATEPYPEAWTS
NKAVLGMLITGILAELLTACYILKEDEIEVVFKFNGAGDWVIYDTGDSGFFSEEAMGIAALYSYGTWLVV
VTGWSLLIGVLVIMEVTRGN
>gi|295065592|ref|YP_003587393.1| NADH [Nomascus siki]
FVGFSSKPSPIYGGLVLVVSGVVGCAVILNCGGGYLGLMVFLIYLGGMMVVFGYTTAMAIEEYPEAWGSG
VEVLVGVLVGLAMEVGLVLWAKECDGLVMVLNFNNMGSWVIYEGEGSGLIREDSIGAGALYDYGRWLVVV
TGWTLLVGVYIVIEIARGN
>gi|295065550|ref|YP_003587316.1| NADH (mitochondrion) [Symphalangus syndactylus]
FVGFSSKPSPIYGGLVLVVSGVVGCAIILDCGGGYLGLMVFLIYLGGMMVVFGYTTAMAIEEYPEAWGSG
VEVLVGVLVGLAMEVGLVLWAKEYDGLVVVLNFNNMGSWVIYEGEGSGLIREDSIGAGALYDYGRWLVVV
TGWTLLVGVYIVIEIARGN