I have a text file, and I have a condition set up where I need to extract a chunk of text every other line, but the chunk of text can be any amount of lines (a FASTA file, for any bioinformatics people). It's basically set up like this:
> header, info, info
TEXT-------------------------------------------------------
----------------------------------------------------
>header, info...
TEXT-----------------------------------------------------
... and so forth.
I am trying to extract the "TEXT" part. Here's the code I have set up:
for line in ffile:
if line.startswith('>'):
# do stuff to header line
try:
sequence = ""
seqcheck = ffile.next() # line after the header will always be the beginning of TEXT
while not seqcheck.startswith('>'):
sequence += seqcheck
seqcheck = ffile.next()
except: # iteration error check
break
This doesn't work, because every time I call next(), it continues the for loop, which results in me skipping a lot of lines and losing a lot of data. How can I just "peek" into the next line, without moving the iterator forward?
I guess if you would check that data doesn't starts with '>'
would be a lot easier.
>>> content = '''> header, info, info
... TEXT-------------------------------------------------------
... ----------------------------------------------------
... >header, info...
... TEXT-----------------------------------------------------'''
>>>
>>> f = StringIO(content)
>>>
>>> my_data = []
>>> for line in f:
... if not line.startswith('>'):
... my_data.append(line)
...
>>> ''.join(my_data)
'TEXT-------------------------------------------------------\n----------------------------------------------------\nTEXT-----------------------------------------------------'
>>>
@tobias_k this should separate lines:
>>> def get_content(f):
... my_data = []
... for line in f:
... if line.startswith('>'):
... yield my_data
... my_data = []
... else:
... my_data.append(line)
... yield my_data # the last on
...
>>>
>>> f.seek(0)
>>> for i in get_content(f):
... print i
...
[]
['TEXT-------------------------------------------------------\n', '----------------------------------------------------\n']
['TEXT-----------------------------------------------------']
>>>