Human genome is made of 24 different chromosomes (actually 23 pairs= 46 chromosomes). this chromosomes are called 1
, 2
, 3
, ..., 22
, X
and Y
. Each chromosome is a very long string of 'G'
, 'C'
, 'A'
and 'T'
characters (for example chromosome 1 is made of almost 24 milion characters). I have each chromosome in a file (strand of chromosome 1 is in 1.fa
file).
*.fa
file is called fasta file which is an standard file for DNA strands information. this file has a structure like this:
>gi|568815591|ref|NC_000007.14| Homo sapiens chromosome 7, GRCh38 Primary Assembly
CATTGCACTCCAGCCTGGGCAAAAACAGCGAAACTCCGTCTCAAAAAAAAAAAAAAGAAAAAAT
TAGCCAGGCATGGTGAAGTTGCAGTGAGCTGAGACTGCACCATTGCACTCCAGCCTGGGTAGCA
As you see this kind of files contain a first line that give us some information about the source of GCAT string.
I wrote this code to count the GC content (ratio of number of G+C characters to all characters):
homo_sapiens_chromosomes_List=[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, "x", "y"]
for i in homo_sapiens_chromosomes_List:
i=str (i)
file_open= open (i+".fa", "r") #Opening each file
file_read= file_open.read() #Reading file
file_read=file_read.upper() #Uppercase characters
G=float(file_read.count("G")) #Count G in file
C=float(file_read.count("C")) #Count C in file
A=float(file_read.count("A")) #Count A in file
T=float(file_read.count("T")) #Count A in file
print "There are %d Gs, %d Cs, %d As and %d Ts, in the DNA strand, of chromosome number %s." % (G, C, A, T, i)
print "GC content of this chromosome is:", (G+C)*100/(A+T+G+C), "percent" #Prints GC Content
Now I have some questions:
How can I make this code more efficient (faster, shorter or...)
When I try to count GC content, the first line of fasta file which is not a part of DNA strand, is also counted. I wrote this function to delete this line before counting the GC content (this code comes after this line: file_read=file_read.upper()
):
Code:
def Fasta_Clean(): #a function to delete the first line of fasta file
global file_read
if file_read.isalnum()==False:
file_read=file_read[1:]
Fasta_Clean()
Fasta_Clean()
But this code returned: "RuntimeError: maximum recursion depth exceeded in cmp"
, so I wrote this one:
def Fasta_Clean(): #a function to delete the first line of fasta file
global file_read
fas=file_read[0:number]
if fas.isalnum()==False:
file_read=file_read[1:]
Fasta_Clean()
Fasta_Clean()
Now, when variable fas
is more than fas=file_read[0:90]
, I see "RuntimeError: maximum recursion depth exceeded in cmp"
again. How can I solve this problem?
Example:
>gi|568815591|ref|NC_000007.14| Homo sapiens chromosome 7, GRCh38 Primary Assembly
GCGAAACTCCGTCTCAAAAAAAAAAAAAAGAAAAAATCATTGCACTCCAGCCTGGGCAAAAACA
CACCATTGCACTCCAGCCTGGGTAGCATAGCCAGGCATGGTGAAGTTGCAGTGAGCTGAGACTG
>gi|568815864|ref|NC_000009.14| Homo sapiens chromosome 8, GRCh38 Primary Assembly
CATTGCACTCCAGCCTGGGCAAAAACAGCGAAACTCCGTCTCAAAAAAAAAAAAAAGAAAAAAT
TAGCCAGGCATGGTGAAGTTGCAGTGAGCTGAGACTGCACCATTGCACTCCAGCCTGGGTAGCA
>gi|568815325|ref|NC_000009.14| Homo sapiens chromosome 9, GRCh38 Primary Assembly
CTGGGCAAAAACAGCGAAACTCCGTCTCAAAAAAAAAAAAAAGAAACATTGCACTCCAGCAAAT
GTGAGCTGAGACTGCACCATTGCTAGCCAGGCATGGTGAAGTTGCAACTCCAGCCTGGGTAGCA
In this conditions how can I count GC content of each strand separately?
Here this should be just about all the code you need.
from collections import Counter
chrome_list=[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11,
12, 13, 14, 15, 16, 17, 18, 19, 20,
21, 22, "x", "y"]
for i in chrome_list:
file_ = open('{}.fa'.format(i), 'r')
broken_file = file_.read().split('\n\n')
for line in broken_file:
print Counter(line.split('\n')[1])
file_.close()
If you are using windows or mac you may need to change the \n\n