Search code examples
pythonpython-2.7biopythonfasta

Using Biopython (Python) to extract sequence from FASTA file


Ok so I need to extract part of a sequence from a FASTA file, using python (biopython, http://biopython.org/DIST/docs/tutorial/Tutorial.html)

I need to get the first 10 bases from each sequence and put them in one file, preserving the sequence info from the FASTA format. Worst comes to worst, I could just use the bases if there's no way to keep the sequence info. So here's an example:

>gi|2765658|emb|Z78533.1|CIZ78533 C.irapeanum 5.8S rRNA gene and ITS1 and ITS2 DNA
CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGATGAGACCGTGGAATAAACGATCGAGTG
AATCCGGAGGACCGGTGTACTCAGCTCACCGGGGGCATTGCTCCCGTGGTGACCCTGATTTGTTGTTGGG

>gi|2765658|emb|Z78533.1|CIZ78533 C.irapeanum 5.8S rRNA gene and ITS1 and ITS2 DNA
CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGATGAGACCGTGGAATAAACGATCGAGTG
AATCCGGAGGACCGGTGTACTCAGCTCACCGGGGGCATTGCTCCCGTGGTGACCCTGATTTGTTGTTGGG

>gi|2765658|emb|Z78533.1|CIZ78533 C.irapeanum 5.8S rRNA gene and ITS1 and ITS2 DNA
CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGATGAGACCGTGGAATAAACGATCGAGTG
AATCCGGAGGACCGGTGTACTCAGCTCACCGGGGGCATTGCTCCCGTGGTGACCCTGATTTGTTGTTGGG

I need some way to get the first 10 bases (and then I was planning on doing it again for the last 10 bases). That tutorial site is pretty thorough but I'm new to this and since it doesn't go into this, I'm not even sure if it's possible. Thanks for any help you can give.


Solution

  • Biopython is just perfect for these kinds of tasks. The Seq-Object stores a sequence and info about it. Reading the fasta file format is straight forward. You can access the sequence like a simple list and, hence, access certain positions straight forward as well:

    from Bio import SeqIO
    
    with open("outfile.txt","w") as f:
            for seq_record in SeqIO.parse("infile.fasta", "fasta"):
                    f.write(str(seq_record.id) + "\n")
                    f.write(str(seq_record.seq[:10]) + "\n")  #first 10 base positions
                    f.write(str(seq_record.seq[-10:]) + "\n") #last 10 base positions