Search code examples
pythonstringfastadna-sequence

Count GC content of fasta using python without error


Human genome is made of 24 different chromosomes (actually 23 pairs= 46 chromosomes). this chromosomes are called 1, 2, 3, ..., 22, X and Y. Each chromosome is a very long string of 'G', 'C', 'A' and 'T' characters (for example chromosome 1 is made of almost 24 milion characters). I have each chromosome in a file (strand of chromosome 1 is in 1.fa file).

*.fa file is called fasta file which is an standard file for DNA strands information. this file has a structure like this:

>gi|568815591|ref|NC_000007.14| Homo sapiens chromosome 7, GRCh38 Primary Assembly
CATTGCACTCCAGCCTGGGCAAAAACAGCGAAACTCCGTCTCAAAAAAAAAAAAAAGAAAAAAT
TAGCCAGGCATGGTGAAGTTGCAGTGAGCTGAGACTGCACCATTGCACTCCAGCCTGGGTAGCA

As you see this kind of files contain a first line that give us some information about the source of GCAT string.

I wrote this code to count the GC content (ratio of number of G+C characters to all characters):

homo_sapiens_chromosomes_List=[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, "x", "y"]
for i in homo_sapiens_chromosomes_List:
    i=str (i)
    file_open= open (i+".fa", "r") #Opening each file

    file_read= file_open.read() #Reading file

    file_read=file_read.upper() #Uppercase characters

    G=float(file_read.count("G"))   #Count G in file

    C=float(file_read.count("C"))   #Count C in file

    A=float(file_read.count("A"))   #Count A in file

    T=float(file_read.count("T"))   #Count A in file

    print "There are %d Gs, %d Cs, %d As and %d Ts, in the DNA strand, of chromosome number %s." % (G, C, A, T, i)

    print "GC content of this chromosome is:", (G+C)*100/(A+T+G+C), "percent"       #Prints GC Content

Now I have some questions:

  1. How can I make this code more efficient (faster, shorter or...)

  2. When I try to count GC content, the first line of fasta file which is not a part of DNA strand, is also counted. I wrote this function to delete this line before counting the GC content (this code comes after this line: file_read=file_read.upper()):

Code:

def Fasta_Clean(): #a function to delete the first line of fasta file
    global file_read
    if file_read.isalnum()==False:
        file_read=file_read[1:]
        Fasta_Clean()
    Fasta_Clean()

But this code returned: "RuntimeError: maximum recursion depth exceeded in cmp", so I wrote this one:

def Fasta_Clean(): #a function to delete the first line of fasta file
    global file_read
    fas=file_read[0:number]
    if fas.isalnum()==False:
        file_read=file_read[1:]
        Fasta_Clean()
    Fasta_Clean()

Now, when variable fas is more than fas=file_read[0:90], I see "RuntimeError: maximum recursion depth exceeded in cmp" again. How can I solve this problem?

  1. If the fasta file is made of more than one strand and has an structure like this (in this example file is made of three different strands):

Example:

>gi|568815591|ref|NC_000007.14| Homo sapiens chromosome 7, GRCh38 Primary Assembly
GCGAAACTCCGTCTCAAAAAAAAAAAAAAGAAAAAATCATTGCACTCCAGCCTGGGCAAAAACA
CACCATTGCACTCCAGCCTGGGTAGCATAGCCAGGCATGGTGAAGTTGCAGTGAGCTGAGACTG

>gi|568815864|ref|NC_000009.14| Homo sapiens chromosome 8, GRCh38 Primary Assembly
CATTGCACTCCAGCCTGGGCAAAAACAGCGAAACTCCGTCTCAAAAAAAAAAAAAAGAAAAAAT
TAGCCAGGCATGGTGAAGTTGCAGTGAGCTGAGACTGCACCATTGCACTCCAGCCTGGGTAGCA

>gi|568815325|ref|NC_000009.14| Homo sapiens chromosome 9, GRCh38 Primary Assembly
CTGGGCAAAAACAGCGAAACTCCGTCTCAAAAAAAAAAAAAAGAAACATTGCACTCCAGCAAAT
GTGAGCTGAGACTGCACCATTGCTAGCCAGGCATGGTGAAGTTGCAACTCCAGCCTGGGTAGCA

In this conditions how can I count GC content of each strand separately?


Solution

  • Here this should be just about all the code you need.

    from collections import Counter
    chrome_list=[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11,
                 12, 13, 14, 15, 16, 17, 18, 19, 20,
                 21, 22, "x", "y"]
    for i in chrome_list:
        file_ = open('{}.fa'.format(i), 'r')
        broken_file = file_.read().split('\n\n')
        for line in broken_file:
            print Counter(line.split('\n')[1])
        file_.close()
    

    If you are using windows or mac you may need to change the \n\n