I have the feeling that my question is related to Why does takewhile() skip the first line?
I haven't found satisfactory answers in there though.
My examples below use the following modules
import csv
from itertools import takewhile
Here is my problem. I have a csv file which I want to parse using itertools.
For instance, i want to separate the header from the content. This is spotted by the presence of a keyword in the first column.
Here is file.csv
example
a, content
b, content
KEYWORD, something else
c, let's continue
The two first lines compose the header of the file.
The KEYWORD
line separates it from the content: the last line.
Even, if it is not properly part of the content, I want to parse the separation row.
with open('file.csv', 'rb') as f:
reader = csv.reader(f)
header = takewhile(lambda x: x[0] != 'KEYWORD', reader)
for row in header:
print(row)
print('End of header')
for row in reader:
print(row)
I was not expecting this, but the KEYWORD
line is skipped.
As you will see in the following output:
['a', ' content']
['b', ' content']
End of header
['c', " let's continue"]
I have tried simulating the csv reader to see if it was coming from there. But apparently not. The following code produces the same behavior.
l = [['a', 'content'],
['b','content'],
['KEYWORD', 'something else'],
['c', "let's continue"]]
i = iter(l)
header = takewhile(lambda x: x[0] != 'KEYWORD', i)
for row in header:
print(row)
print('End of header')
for row in i:
print(row)
How can I do to use the feature of takewhile, while preventing the following for the skip the unparsed line ?
As I have understood, the first for
calls for next
on the iterator, to test its content.
The second calls for next
once again, to gather the value.
And the separation row is hence skipped.
Thanks to @jonrsharpe, I came to question myself on some trick to code. Here is what I reached :
class RewindableFile(file):
def __init__(self, *args, **kwargs):
nb_backup = kwargs.pop('nb_backup', 1)
super(RewindableFile, self).__init__(*args, **kwargs)
self._nb_backup = nb_backup
self._backups = []
self._time_anchor = 0
def next(self):
if self._time_anchor >= 0:
item = super(RewindableFile, self).next()
self._backup(item)
return item
else:
item = self._forward()
return item
def rewind(self):
self._time_anchor = self._time_anchor - 1
time_bound = min(self._nb_backup, len(self._backups))
if self._time_anchor < -time_bound:
raise Exception('You have gone too far in history...')
def __iter__(self):
return self
def _backup(self, row):
self._backups.append(row)
extra_items = len(self._backups) - self._nb_backup
if extra_items > 0:
del self._backups[0:extra_items]
def _forward(self):
item = self._backups[self._time_anchor]
self._time_anchor = self._time_anchor + 1
return item
And how I use it :
with RewindableFile('csv.csv', 'rb') as f:
def test_kwd_and_rewind(x):
if x[0] != 'KEYWORD':
return True
else:
f.rewind()
return False
reader = csv.reader(f)
header = takewhile(test_kwd_and_rewind, reader)
for row in header:
print(row)
print('End of header')
for row in reader:
print(row)
I could have also overload read
and readline
functions to save the jump
.
But I don't need them here.