python bioinformatics dna-sequence transcription

Parsing items in a list of lists, in groups of three, and extract a fragment between a reading frame. (AKA DNA exon transcription)

I am trying to find a way to read items in a list of lists, in a group of three, and find a combination of 3 items (codon) to determine the beginning of the fragment and another combination of 3 items to find the end of a fragment (stop codon).

Thus, the reading frame and the list should be read by the program like this:

list 1: XXXXX-start-fragment of interest-stop-XXXXXXX

What I'm trying to do is just to extract the fragment of interest and append it into another list and just get rid of the rest.

This is a more concrete example:

Start codon: ATG

Stop codon: TAG

gene_1= 'ACGGACTATTC'

gene_2= 'GGCCATGAGTAACGCATAGGGCCC

gene_3=GGGCCCATGACGTACTAGGGGCCCATGCATTCATAG

So, the first list does not contain any fragment of interest, whereas the second contains 1 and the third contains 2. I'm trying to get rid of everything outside these reading frames and append these fragments of interest into a list that should look something like this.

frag_int = ['AGTAACGCA', 'ACGTAC', 'CATTCA']

This is what I have so far:

#These are str genelist=[]

gene_1= 'A','C','G','G','A','C','T','A','T','T','C'
gene_2= 'G','G','C','C','A','T','G','A','G','T','A','A','C','G','C','A','T','A','G','G','G','C','C','C'
gene_3='G','G','G','C','C','C','A','T','G','A','C','G','T','A','C','T','A','G','G','G','G','C','C','C','A','T','G','C','A','T','T','C','A','T','A','G'

genelist.append(gene_1)
genelist.append(gene_2)
genelist.append(gene_3)

def transcription(ORF):
    mRNA= ''
    for i in range(0, len(ORF), 3):
        codon= ORF[i:i+3]
        if codon != 'ATG':
            next(codon)
            if codon == 'ATG':
                mRNA=codon.transcribe()
                if codon == 'TAG':
                    break
    return(mRNA)

mRNAs=[]
for gene in genelist:
    for codon in gene:
        mRNA= transcription(codon)
        mRNAs.append(mRNA)
print(mRNAs)

But it is not really giving anything back, I wonder if the code it's too redundant and I really don't need to define a function here, do you know any better way to do this? Thaaanks!!

Solution

Thanks, everyone for your comments, I went to the bioinformatics section and got help from @terdon. This is the most basic way of doing what I described in the problem, however, note that if anyone is trying to find ORFS and transcribe genes, in a program with python there are some biologic rules to take into account and the reading and the stop codons should be considered, however, this is just an example on how to start building the code: Also, note that this code uses biopython

from Bio.Seq import Seq from Bio.Seq import transcribe

genelist=[]

gene_1= 'A','C','G','G','A','C','T','A','T','T','C'
gene_2= 'G','G','C','C','A','T','G','A','G','T','A','A','C','G','C','A','T','A','G','G','G','C','C','C'
gene_3='G','G','G','C','C','C','A','T','G','A','C','G','T','A','C','T','A','G','G','G','G','C','C','C','A','T','G','C','A','T','T','C','A','T','A','G'

genelist.append(gene_1)
genelist.append(gene_2)
genelist.append(gene_3)

def transcription(ORF):
    mRNA= ''
    foundStart = False
    foundEnd = False
    for i in range(0, len(ORF), 3):
        codon= "".join(ORF[i:i+3])
        if codon == 'ATG' and not foundStart:
            foundStart = True
        if foundStart and not foundEnd:
            cc=transcribe(codon)
            mRNA = mRNA + transcribe(codon)
        if codon == 'TAG':
            foundEnd = True
       
    return(mRNA)

mRNAs=[]
for gene in genelist:
    mRNA = transcription(gene)
    mRNAs.append(mRNA)
print(mRNAs)