Search code examples
pythonbiopythonfastagff

Biopython parsing over gff features to extract CDS


Hello I'm trying to extract the coding sequences from a fasta file using a gff file with the help of biopython (https://biopython.org/wiki/GFF_Parsing)

I have tried doing what this tutorial describes but there is something I just don't seem to get right for some reason: when I iterate over the features of the sequence record only 'gff_type':'gene' is recognized.

Here is an example of what my gff file looks like: enter image description here

As you can see, my file clearly contains gff_type='CDS' entries

But when I run a simple script like this:

for rec in GFF.parse(in_handle):
    for feature in rec.features:
        if feature.type == 'CDS':
            print(feature)

No output is returned whatsoever. And when I run something like this:

for rec in GFF.parse(in_handle):
    print(rec.features)
    break

SeqFeature(FeatureLocation(ExactPosition(2951), ExactPosition(4284), strand=-1), type='gene', id='Csa1G000010'), 
SeqFeature(FeatureLocation(ExactPosition(5473), ExactPosition(14089), strand=-1), type='gene', id='Csa1G000020'), 
SeqFeature(FeatureLocation(ExactPosition(18683), ExactPosition(21806), strand=1), type='gene', id='Csa1G000030'),
...

There are 2 things I don't understand:

  • why are only type='gene' features being iterated over
  • there is a break in my script, so only 1 Sequence Record should be iterated over, if a sequence record is determined as all occurences of a single ID (for example Chr1) and a 'feature' is every gene for this ID than what is CDS? (CDS and gene are both listed in the possible gff_type values so I don't understand why one would get the preference over the other)
gff_type: {('CDS',):117359, ('exon',):120675, ('five_prime_utr',):16038, ('gene',):24274, ('mRNA',):24274, ('three_prime_utr',):15588}

I feel like the solution for me would be to iterate within each gene and extract the CDS that way but there is no SeqRecord.feature.feature attribute

There is however a SeqRecord.feature.sub_feature attribute and resulted in iteration over the mRNA in the gene (of which there is also only 1) but there is no sub_sub_feature (I checked with dir)

And so Im stuck with this problem that seems so incredibly simple. I know I could just iterate over the gff file as txt making use of the tab seperation but I am trying to get more acquainted with Biopython but to no avail (the Biopython documentation is also unfortunately quite bad). I hope someone knows how to set me on my way.

Kind Regards


Solution

  • The structure is gene -> mRNA -> exon | CDS | other sequences like 3'UTR. The exon or CDS are sub_features of mRNA, not sub_sub_features of gene.

    See Examining your GFF file here: https://biopython.org/wiki/GFF_Parsing