Python calculate ORFs from any arbitrary reading frame

I have a big fasta file in this format:

>gi|142022655|gb|EQ086233.1|522 marine metagenome JCVI_SCAF_1096627390048 genomic scaffold, whole genome shotgun sequence
AAGACGGGCACCGTGTCCTTCGCGACGTACTCCGACCAGTTGTACACGTTCAGGTTGGTGTCGCCGGCAT
GGGCCGACAGGCTGGCCGCGACGGCCAGCGCCGCCGACGTGACGCGCGCGGCGCGCAACGCCGATTGACG
ACGGATACGGATACGCATGGGGATTCTCCTTGTGATGGGGATCGGCCGTTGCGCCCGGTCCGGGTCCGGA
CTCGCGTCAACGCCGTCGAGCGGTGTTCAGCACAAGGGCCAATGTAGAGATCGCGGCCGGCAGCGTCAGT
CCCGAAAACCGGGACAAACGGCGACGTCGATTCCCGCCGTTTGGGTAGATTCCCGCGTAGGCAGTCGAAA
ATATTCGTGATACCTGTAGCGCCACCTGAAAATCTTCGATACACGACGCCATGAGCGCTGCGCTGCCCGC
CCCCGATCTTCCGCTGAGCCACGTCGCGTTCGTGACTGAAACGCTGGGCGACATCGCACAAGCCGTCGGA
ACGCCGCAGTTCATGCGCGCCGTCTACGACACGCTCGTGCGCTACGTCGATTTCGACGCCGTGCACCTCG
ACTACGAGCGCAGCGCGTCTTCCGGCCGGCGCAGCGTCGGCTGGATCGGCAGCTTCGGCCGCGAGCCCGA
GCTGGTCGCGCAGGTGATGCGCCACTACTACCGCAGCTACGCGAGCGACGATGCAACTTACGCGGCGATC
GAAACCGAAAACGACGTGCAATTGCTGCAGGTGTCCGCGCAACGCGTGTCGAGCGAGCTACGGCATCTGT
TCTTCGATGCCGGCGACATTCATGACGAATGCGTGATCGCCGGCGTGACGGGCGGCACGCGCTACTCGAT
CTCGATCGCGCGCTCACGGCGGCTGCCGCCGTTTTCGCTGAAGGAACTGAGCCTGCTGAAGCAGCTTTCG
CAAGTCGTGCTGCCGCTGGCGTCCGCGCACAAGCGCCTGCTCGGCGCGATCTCCGCCGACGACGCACCGC
GCGACGAACTCGATCTCGACCTCGTCGCGCAATGGCTGCCGGAATGGCAGGAACGGTTGACCGCGCGCGA
GATGCATGTGTGTGCGTCGTTCATCCAGGGCATGACGTCGGCGGCCATCGCCCAATCGATGGGGCTCAAG
ACCTCCACCGTCGATACCTACGCGAAGCGCGCCTTCGCGAAGCTCGGCGTCGATTCGCGAAGGCAACTGA
TGACCCTCGTGCTGAGAAACGCGTCGCGGCGGCATGACGCATAGCATCC
>gi|142022655|gb|EQ086233.1|598 marine metagenome JCVI_SCAF_1096627390048 genomic scaffold, whole genome shotgun sequence
TTGCCGCCGGCCGCAGCCGGCTTGGCACCACGCTGCGGCTGGTCGCCGGACTTCGGCTTCGCGCCGGTGT
CCGCCGGCGCTGCCGGCCGCTTCGCGTTGCGCTCCTGCTTGGCCTTCGCTGCGAGCTGCGCCCGCAATTC
GGCAAGTTGTTCAAAACCCATAAATTCAATCCACCAGGAATATAAGGTGTGGTTCGTGCGGCCATGCCGC
GCGGCGCACGAGCTTCGCCGCCATGCGTGCGACCCGTCTGCCGCCGATGCGGAATACTACGGGGCCGCAT
>gi|142022655|gb|EQ086233.1|143 marine metagenome JCVI_SCAF_1096627390048 genomic scaffold, whole genome shotgun sequence
CTGATGCGTGCGCGCGGCCGCCTGCAGCCAGCGCGTCAGTTCCGGCGCCGCCGCGCGGCTGTAGTTCAGCGCG
CCGCCGCGATCGACGGGCAGGTAATGGCCTTCGATGTCGATGCCGTCCGGCGGCGTGTTCGAGTTCGCGA
TCGAGCCGAACTTGCCGGTCTTGCGCGCCTCGACGTACGTGCCGTCGTCGACGTACTGGATCTTCAGGTC
GACGCCGAGCCGCTGCCGCGCCTGCGCCTGCAGCGCCTGCAGCAGCACGTCGCGCTGGTCGCGCACGGTC

I want to be able to find out the length of the longest open reading frame (ORF) appearing in reading frame 3 of any of the sequences?

So far, I have tried some code that lists out all the ORFs of one sequence, inputted as a string:

import re
from string import maketrans

pattern = re.compile(r'(?=(ATG(?:...)*?)(?=TAG|TGA|TAA))')

def revcomp(dna_seq):
    return dna_seq[::-1].translate(maketrans("ATGC","TACG"))

def orfs(dna):
    return set(pattern.findall(dna) + pattern.findall(revcomp(dna)))

print orfs(Seq)

where Seq='''CTGATGCGTGCGCGCGGCCGCCTGCAGCCAGCGCGTCAGTTCCGGCGCCGCCGCGCGGCTGTAGTTCAGCGCGCCGCCGCGATCGACGGGCAGGTAATGGCCTTCGATGTCGATGCCGTCCGGCGGCGTGTTCGAGTTCGCGATCGAGCCGAACTTGCCGGTCTTGCGCGCCTCGACGTACGTGCCGTCGTCGACGTACTGGATCTTCAGGTCGACGCCGAGCCGCTGCCGCGCCTGCGCCTGCAGCGCCTGCAGCAGCACGTCGCGCTGGTCGCGCACGGTC''' Notice that this is the 3rd entry in the big fasta file format above.

My sample output to this is: set([]), so I am clearly doing something terribly wrong. My code doesn't even scale to multiple entries (i.e., it only takes a single DNA string, called Seq)

Can anyone point me in the right direction please?

EDIT:

N.B.: ATG is the start codon (i.e., the beginning of an ORF) and TAG, TGA, and TAA are stop codons (i.e., the end of an ORF).

Solution

EDITED 1: Completely rewritten to better match problem description.

I don't know the exact file format here, so am assuming it carries on the same way as the three sequences you show -- one sequence after another.

If I understand correctly, the reason you didn't see a match in the third sequence is that there actually isn't a match there. There are matches in the first two, though, and you will see them if you run this.

'''

import re
import string

with open('dna.txt', 'rb') as f:
    data = f.read()

data = [x.split('\n', 1) for x in data.split('>')]
data = [(x[0], ''.join(x[1].split())) for x in data if len(x) == 2]

start, end = [re.compile(x) for x in 'ATG TAG|TGA|TAA'.split()]

revtrans = string.maketrans("ATGC","TACG")

def get_longest(starts, ends):
    ''' Simple brute-force for now.  Optimize later...
        Given a list of start locations and a list
        of end locations, return the longest valid
        string.  Returns tuple (length, start position)

        Assume starts and ends are sorted correctly
        from beginning to end of string.
    '''
    results = {}
    # Use smallest end that is bigger than each start
    ends.reverse()
    for start in starts:
        for end in ends:
            if end > start and (end - start) % 3 == 0:
                results[start] = end + 3
    results = [(end - start, start) for
               start, end in results.iteritems()]
    return max(results) if results else (0, 0)

def get_orfs(dna):
    ''' Returns length, header, forward/reverse indication,
        and longest match (corrected if reversed)
    '''
    header, seqf = dna
    seqr = seqf[::-1].translate(revtrans)
    def readgroup(seq, group):
        return list(x.start() for x in group.finditer(seq))
    f = get_longest(readgroup(seqf, start), readgroup(seqf, end))
    r = get_longest(readgroup(seqr, start), readgroup(seqr, end))
    (length, index), s, direction = max((f, seqf, 'forward'), (r, seqr, 'reverse'))
    return length, header, direction, s[index:index + length]

# Process entire file
all_orfs = [get_orfs(x) for x in data]

# Put in groups of 3
all_orfs = zip(all_orfs[::3], all_orfs[1::3], all_orfs[2::3])

# Process each group of 3
for x in all_orfs:
    x = max(x)   # Only pring longest in each group
    print(x)
    print('')