Search code examples
pythonbioinformaticsfasta

Python calculate ORFs from any arbitrary reading frame


I have a big fasta file in this format:

>gi|142022655|gb|EQ086233.1|522 marine metagenome JCVI_SCAF_1096627390048 genomic scaffold, whole genome shotgun sequence
AAGACGGGCACCGTGTCCTTCGCGACGTACTCCGACCAGTTGTACACGTTCAGGTTGGTGTCGCCGGCAT
GGGCCGACAGGCTGGCCGCGACGGCCAGCGCCGCCGACGTGACGCGCGCGGCGCGCAACGCCGATTGACG
ACGGATACGGATACGCATGGGGATTCTCCTTGTGATGGGGATCGGCCGTTGCGCCCGGTCCGGGTCCGGA
CTCGCGTCAACGCCGTCGAGCGGTGTTCAGCACAAGGGCCAATGTAGAGATCGCGGCCGGCAGCGTCAGT
CCCGAAAACCGGGACAAACGGCGACGTCGATTCCCGCCGTTTGGGTAGATTCCCGCGTAGGCAGTCGAAA
ATATTCGTGATACCTGTAGCGCCACCTGAAAATCTTCGATACACGACGCCATGAGCGCTGCGCTGCCCGC
CCCCGATCTTCCGCTGAGCCACGTCGCGTTCGTGACTGAAACGCTGGGCGACATCGCACAAGCCGTCGGA
ACGCCGCAGTTCATGCGCGCCGTCTACGACACGCTCGTGCGCTACGTCGATTTCGACGCCGTGCACCTCG
ACTACGAGCGCAGCGCGTCTTCCGGCCGGCGCAGCGTCGGCTGGATCGGCAGCTTCGGCCGCGAGCCCGA
GCTGGTCGCGCAGGTGATGCGCCACTACTACCGCAGCTACGCGAGCGACGATGCAACTTACGCGGCGATC
GAAACCGAAAACGACGTGCAATTGCTGCAGGTGTCCGCGCAACGCGTGTCGAGCGAGCTACGGCATCTGT
TCTTCGATGCCGGCGACATTCATGACGAATGCGTGATCGCCGGCGTGACGGGCGGCACGCGCTACTCGAT
CTCGATCGCGCGCTCACGGCGGCTGCCGCCGTTTTCGCTGAAGGAACTGAGCCTGCTGAAGCAGCTTTCG
CAAGTCGTGCTGCCGCTGGCGTCCGCGCACAAGCGCCTGCTCGGCGCGATCTCCGCCGACGACGCACCGC
GCGACGAACTCGATCTCGACCTCGTCGCGCAATGGCTGCCGGAATGGCAGGAACGGTTGACCGCGCGCGA
GATGCATGTGTGTGCGTCGTTCATCCAGGGCATGACGTCGGCGGCCATCGCCCAATCGATGGGGCTCAAG
ACCTCCACCGTCGATACCTACGCGAAGCGCGCCTTCGCGAAGCTCGGCGTCGATTCGCGAAGGCAACTGA
TGACCCTCGTGCTGAGAAACGCGTCGCGGCGGCATGACGCATAGCATCC
>gi|142022655|gb|EQ086233.1|598 marine metagenome JCVI_SCAF_1096627390048 genomic scaffold, whole genome shotgun sequence
TTGCCGCCGGCCGCAGCCGGCTTGGCACCACGCTGCGGCTGGTCGCCGGACTTCGGCTTCGCGCCGGTGT
CCGCCGGCGCTGCCGGCCGCTTCGCGTTGCGCTCCTGCTTGGCCTTCGCTGCGAGCTGCGCCCGCAATTC
GGCAAGTTGTTCAAAACCCATAAATTCAATCCACCAGGAATATAAGGTGTGGTTCGTGCGGCCATGCCGC
GCGGCGCACGAGCTTCGCCGCCATGCGTGCGACCCGTCTGCCGCCGATGCGGAATACTACGGGGCCGCAT
>gi|142022655|gb|EQ086233.1|143 marine metagenome JCVI_SCAF_1096627390048 genomic scaffold, whole genome shotgun sequence
CTGATGCGTGCGCGCGGCCGCCTGCAGCCAGCGCGTCAGTTCCGGCGCCGCCGCGCGGCTGTAGTTCAGCGCG
CCGCCGCGATCGACGGGCAGGTAATGGCCTTCGATGTCGATGCCGTCCGGCGGCGTGTTCGAGTTCGCGA
TCGAGCCGAACTTGCCGGTCTTGCGCGCCTCGACGTACGTGCCGTCGTCGACGTACTGGATCTTCAGGTC
GACGCCGAGCCGCTGCCGCGCCTGCGCCTGCAGCGCCTGCAGCAGCACGTCGCGCTGGTCGCGCACGGTC

I want to be able to find out the length of the longest open reading frame (ORF) appearing in reading frame 3 of any of the sequences?

So far, I have tried some code that lists out all the ORFs of one sequence, inputted as a string:

import re
from string import maketrans

pattern = re.compile(r'(?=(ATG(?:...)*?)(?=TAG|TGA|TAA))')

def revcomp(dna_seq):
    return dna_seq[::-1].translate(maketrans("ATGC","TACG"))

def orfs(dna):
    return set(pattern.findall(dna) + pattern.findall(revcomp(dna)))

print orfs(Seq)

where Seq='''CTGATGCGTGCGCGCGGCCGCCTGCAGCCAGCGCGTCAGTTCCGGCGCCGCCGCGCGGCTGTAGTTCAGCGCGCCGCCGCGATCGACGGGCAGGTAATGGCCTTCGATGTCGATGCCGTCCGGCGGCGTGTTCGAGTTCGCGATCGAGCCGAACTTGCCGGTCTTGCGCGCCTCGACGTACGTGCCGTCGTCGACGTACTGGATCTTCAGGTCGACGCCGAGCCGCTGCCGCGCCTGCGCCTGCAGCGCCTGCAGCAGCACGTCGCGCTGGTCGCGCACGGTC''' Notice that this is the 3rd entry in the big fasta file format above.

My sample output to this is: set([]), so I am clearly doing something terribly wrong. My code doesn't even scale to multiple entries (i.e., it only takes a single DNA string, called Seq)

Can anyone point me in the right direction please?

EDIT:

N.B.: ATG is the start codon (i.e., the beginning of an ORF) and TAG, TGA, and TAA are stop codons (i.e., the end of an ORF).


Solution

  • EDITED 1: Completely rewritten to better match problem description.

    I don't know the exact file format here, so am assuming it carries on the same way as the three sequences you show -- one sequence after another.

    If I understand correctly, the reason you didn't see a match in the third sequence is that there actually isn't a match there. There are matches in the first two, though, and you will see them if you run this.

    '''

    import re
    import string
    
    with open('dna.txt', 'rb') as f:
        data = f.read()
    
    data = [x.split('\n', 1) for x in data.split('>')]
    data = [(x[0], ''.join(x[1].split())) for x in data if len(x) == 2]
    
    start, end = [re.compile(x) for x in 'ATG TAG|TGA|TAA'.split()]
    
    revtrans = string.maketrans("ATGC","TACG")
    
    def get_longest(starts, ends):
        ''' Simple brute-force for now.  Optimize later...
            Given a list of start locations and a list
            of end locations, return the longest valid
            string.  Returns tuple (length, start position)
    
            Assume starts and ends are sorted correctly
            from beginning to end of string.
        '''
        results = {}
        # Use smallest end that is bigger than each start
        ends.reverse()
        for start in starts:
            for end in ends:
                if end > start and (end - start) % 3 == 0:
                    results[start] = end + 3
        results = [(end - start, start) for
                   start, end in results.iteritems()]
        return max(results) if results else (0, 0)
    
    def get_orfs(dna):
        ''' Returns length, header, forward/reverse indication,
            and longest match (corrected if reversed)
        '''
        header, seqf = dna
        seqr = seqf[::-1].translate(revtrans)
        def readgroup(seq, group):
            return list(x.start() for x in group.finditer(seq))
        f = get_longest(readgroup(seqf, start), readgroup(seqf, end))
        r = get_longest(readgroup(seqr, start), readgroup(seqr, end))
        (length, index), s, direction = max((f, seqf, 'forward'), (r, seqr, 'reverse'))
        return length, header, direction, s[index:index + length]
    
    # Process entire file
    all_orfs = [get_orfs(x) for x in data]
    
    # Put in groups of 3
    all_orfs = zip(all_orfs[::3], all_orfs[1::3], all_orfs[2::3])
    
    # Process each group of 3
    for x in all_orfs:
        x = max(x)   # Only pring longest in each group
        print(x)
        print('')