Search code examples
pythoncsvfor-loopiteratorpython-itertools

Iterator using itertools is skipping a line


I have the feeling that my question is related to Why does takewhile() skip the first line?

I haven't found satisfactory answers in there though.

My examples below use the following modules

import csv
from itertools import takewhile

Here is my problem. I have a csv file which I want to parse using itertools.

For instance, i want to separate the header from the content. This is spotted by the presence of a keyword in the first column.

Here is file.csv example

a, content
b, content
KEYWORD, something else
c, let's continue

The two first lines compose the header of the file. The KEYWORD line separates it from the content: the last line.

Even, if it is not properly part of the content, I want to parse the separation row.

with open('file.csv', 'rb') as f:
    reader = csv.reader(f)
    header = takewhile(lambda x: x[0] != 'KEYWORD', reader)
    for row in header:
        print(row)
    print('End of header')
    for row in reader:
        print(row)

I was not expecting this, but the KEYWORD line is skipped. As you will see in the following output:

['a', ' content']
['b', ' content']
End of header
['c', " let's continue"]

I have tried simulating the csv reader to see if it was coming from there. But apparently not. The following code produces the same behavior.

l = [['a', 'content'],
    ['b','content'],
    ['KEYWORD', 'something else'],
    ['c', "let's continue"]]

i = iter(l)
header = takewhile(lambda x: x[0] != 'KEYWORD', i)
for row in header:
    print(row)
print('End of header')
for row in i:
    print(row)

How can I do to use the feature of takewhile, while preventing the following for the skip the unparsed line ?

As I have understood, the first for calls for next on the iterator, to test its content. The second calls for next once again, to gather the value. And the separation row is hence skipped.


Solution

  • Thanks to @jonrsharpe, I came to question myself on some trick to code. Here is what I reached :

    class RewindableFile(file):
        def __init__(self, *args, **kwargs):
            nb_backup = kwargs.pop('nb_backup', 1)
            super(RewindableFile, self).__init__(*args, **kwargs)
            self._nb_backup = nb_backup
            self._backups = []
            self._time_anchor = 0
    
        def next(self):
            if self._time_anchor >= 0:
                item = super(RewindableFile, self).next()
                self._backup(item)
                return item
            else:
                item = self._forward()
                return item
    
        def rewind(self):
            self._time_anchor = self._time_anchor - 1
            time_bound = min(self._nb_backup, len(self._backups))
            if self._time_anchor < -time_bound:
                raise Exception('You have gone too far in history...')
    
        def __iter__(self):
            return self
    
        def _backup(self, row):
            self._backups.append(row)
            extra_items = len(self._backups) - self._nb_backup
            if extra_items > 0:
                del self._backups[0:extra_items]
    
        def _forward(self):
            item = self._backups[self._time_anchor]
            self._time_anchor = self._time_anchor + 1
            return item
    

    And how I use it :

    with RewindableFile('csv.csv', 'rb') as f:
        def test_kwd_and_rewind(x):
            if x[0] != 'KEYWORD':
                return True
            else:
                f.rewind()
                return False
    
        reader = csv.reader(f)
        header = takewhile(test_kwd_and_rewind, reader)
        for row in header:
            print(row)
        print('End of header')
        for row in reader:
            print(row)
    

    I could have also overload read and readline functions to save the jump. But I don't need them here.