Search code examples
bioinformaticsbiopythondna-sequence

Fastest way to add Ns to variable length sequences such that they all equal 150bp


Say I have a fasta containing 3 sequences...

ATTTTTGGA
AT
A

I want my sequence data to look like this:

ATTTTTGGA
ATTNNNNNN
ANNNNNNNN

Are there any programs or scripts that could accomplish this in a reasonable timeframe. I have thousands of sequences. Thanks!

I'm messing around and tried this, the file ended up blank but this is as far as I have gotten.

import sys
from Bio import SeqIO
from Bio.Seq import Seq
in_file = open(sys.argv[1],'r')
sequences = SeqIO.parse(in_file, "fasta")
output_in_file = open("test.fasta", "w")
for record in sequences:
    n = 150
    record.seq = record.seq + ("N" * n)
    seq = seq[:n]
output_in_file.close()
in_file.close()

Solution

  • Improving your code,

    import sys
    from Bio import SeqIO
    from Bio.Seq import Seq
    with open(sys.argv[1], "r") as in_file:
        sequences = list(SeqIO.parse(in_file, "fasta"))
        n = max(map(len, sequences))   #find max len in sequences
        for record in sequences:
            record.seq = record.seq + ("N" * (n-len(record)))
        SeqIO.write(sequences, "test.fasta", "fasta")
    

    you get, in test.fasta

    >id_1
    ATTTTTGGA
    >id_2
    ATNNNNNNN
    >id_3
    ANNNNNNNN
    

    for "all equal 150bp"

    import sys
    from Bio import SeqIO
    from Bio.Seq import Seq
    with open(sys.argv[1], "r") as in_file:
        sequences = list(SeqIO.parse(in_file, "fasta"))
        n = 150
        for record in sequences:
            record.seq = record.seq + ("N" * (n-len(record)))
        SeqIO.write(sequences, "test.fasta", "fasta")
    

    you get,

    >id_1
    ATTTTTGGANNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
    NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
    NNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
    >id_2
    ATNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
    NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
    NNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
    >id_3
    ANNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
    NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
    NNNNNNNNNNNNNNNNNNNNNNNNNNNNNN