Search code examples
pythonbioinformaticsbiopythongff

Bcbio-gff File creation issue


When creating a file using GFF.write(), i get a new line with "annotation remark" as a source, followed by ASCII encoding of sequence regions:

##gff-version 3
##sequence-region NC_011594.1 1 16779
NC_011594.1 annotation  remark  1   16779   .   .   .   gff-version=3;sequence-region=%28%27NC_011594.1%27%2C 0%2C 16971%29,%28%27NC_042493.1%27%2C 0%2C 132544852%29, (continues on and on)
NC_011594.1 RefSeq  gene    1   1531    .   +   .   Dbxref=GeneID:7055888;ID=gene-COX1;Name=COX1;gbkey=Gene;gene=COX1;gene_biotype=protein_coding

Any idea why it's here, what it's for and how i could avoid it? I fear it might become a problem when using it in third-party softwares.

I imported only the bcbio-gff package, but I believe it's part of Biopython, link: https://biopython.org/wiki/GFF_Parsing


Solution

  • To your first question - "Why it is there?"

    • I only presume, that by default the package author wanted to export as much information as possible.

    To your next question - "How can I avoid it?"

    • Unfortunately there is no off switch. For me the solution was to remove any annotations from the exported sequences. (i.e. set the annotations attribute to empty dictionary before calling the GFF.write().

    Example:

    from Bio import SeqIO
    from BCBio import GFF
    
    g = SeqIO.read('NC_003888.3.gb','gb')
    
    g.annotations = {}
    
    with open('t2.gff', 'w') as f:
        GFF.write([g], f)
    

    Output file head - no # annotation remark

    head t2.gff 
    ##gff-version 3
    ##sequence-region NC_003888.3 1 8667507
    NC_003888.3 feature source  1   8667507 ... removed for clarity ....