Search code examples
pythonbioinformaticsbiopythondna-sequence

Counting DNA sequences with python/biopython


My script below is counting the occurrences of the sequences 'CCCCAAAA' and 'GGGGTTTT' from a standard FASTA file:

>contig00001  
CCCCAAAACCCCAAAACCCCAAAACCCCTAcGAaTCCCcTCATAATTGAAAGACTTAAACTTTAAAACCCTAGAAT

The script counts the CCCCAAAA sequence here 3 times

CCCCAAAACCCCAAAACCCCAAAA(CCCC not counted)

Can somebody please advise how I would include the CCCC sequence at the end as a half count to return a value of 3.5 for this.

I've been unsuccessful in my attempts so far.

My script is as follows...

from Bio import SeqIO

input_file = open('telomer.test.fasta', 'r')
output_file = open('telomer.test1.out.tsv','w')
output_file.write('Contig\tCCCCAAAA\tGGGGTTTT\n')

for cur_record in SeqIO.parse(input_file, "fasta") :


    contig = cur_record.name
    CCCCAAAA_count = cur_record.seq.count('CCCCAAAA')
    CCCC_count = cur_record.seq.count('CCCC')

    GGGGTTTT_count = cur_record.seq.count('GGGGTTTT')
    GGGG_count = cur_record.seq.count('GGGG')
    #length = len(cur_record.seq)

    splittedContig1=contig.split(CCCCAAAA_count)

    splittedContig2=contig.split(GGGGTTTT_count)

    cnt1=len(splittedContig1)-1
    cnt2=len(splittedContig2)

  cnt1+sum([0.5 for e in splittedContig1 if e.startswith(CCCC_count)])) = CCCCAAAA_count
  cnt2+sum([0.5 for e in splittedContig2 if e.startswith(GGGG_count)])) = GGGGTTTT_count

    output_line = '%s\t%i\t%i\n' % \
    (CONTIG, CCCCAAAA_count, GGGGTTTT_count)


    output_file.write(output_line)

output_file.close()

input_file.close() 

Solution

  • You can use split and startwith list comprehension as follows:

    contig="CCCCAAAACCCCAAAACCCCAAAACCCCTAcGAaTCCCcTCATAATTGAAAGACTTAAACTTTAAAACCCTAGAAT"
    splitbase="CCCCAAAA"
    halfBase="CCCC"
    splittedContig=contig.split(splitbase)
    cnt=len(splittedContig)-1
    print cnt+sum([0.5 for e in splittedContig if e.startswith(halfBase)])
    

    Output:

    3.5
    
    1. split the strings based on CCCCAAAA. It would give the list, in the list elements CCCCAAAA will be removed
    2. length of splitted - 1 gives the number of occurrence of CCCCAAAA
    3. in the splitted element, look for elements starts with CCCC. If found add 0.5 to count for each occurence.