I am trying to read one chromosome sequence from a genome file in python. The format of the genome file is like the following but with more lines of sequence for each chromosome:
Chr1
ATCGTGTGATGGTGCGTAGATGCTGAT
GCTGATGTGTCGAGCGATGCTGAGTCG
Chr2
TGCGTGATGCTGAGCGATGCTGATGCT
TAGCTGACCACACACCTGTTTTGTAGG
Chr3
CAGTCGTAGCGATGCTGATGATGCTGA
GGTTGGTTGGCGGACCACCATTACTAT
I use the following code to read the whole genome sequence. However, I just want the sequence of one chromosome (e.g. whole sequence of Chr2). Rather than reading the whole genome, then searching the pattern for Chr2, is there any other way I could do this?
Thank you
with open("genome.txt") as f:
for line in f:
genome.append(line.rstrip())
Open the file and read line by line until you find 'Chr2'.
Consume all non-empty lines until you reach EOF or any line beginning with 'Chr'
def getgenomes(gfile):
g = []
for line in gfile:
if line.startswith('Chr'):
break
if (line := line.strip()):
g.append(line)
return g
with open('genome.txt', encoding='utf-8') as gfile:
genomes = None
for line in gfile:
if line.startswith('Chr2'):
genomes = getgenomes(gfile)
break
print(genomes)
output:
['TGCGTGATGCTGAGCGATGCTGATGCT', 'TAGCTGACCACACACCTGTTTTGTAGG']