numpy.genfromtxt skips/ignores last line in long tsv file

I have a tsv-file from QuickDAQ with three columns of 200 000 values that I want to import into numpy. Problem is genfromtxt seems to miss the last line. The line is nothing out of the ordinary as far as I can see:

...
0,00232172012329102     0,0198968648910522      0,0049593448638916
0,00411009788513184     0,0142784118652344      0,00339150428771973
0,00499653816223145     0,00666630268096924     0,00308072566986084

Example of code that doesn't quite work:

In [245]: import numpy as np

In [246]: oompa = np.genfromtxt('C_20k_73_2.tsv',delimiter='\t',usecols=(0,1,2),unpack=True,skip_header=13,dtype=str)

In [248]: oompa[1]
Out[248]: 
array(['-0,00884926319122314', '-0,00379836559295654',
   '0,000106096267700195', ..., '0,0259654521942139',
   '0,0198968648910522', '0,0142784118652344'], 
  dtype='<U21')

The file had windows-style line breaks, I've tried removing these in vi but it doesn't make a difference. What could cause this kind of behaviour from genfromtxt and how could it be dealt with, preferably without manually editing the tsv file?

Solution

Well, the file seems to have some lines with just tabs. I'm surprised np.genfromtxt did not raise a ValueError. One way to prevent the problem would be to remove those empty tab lines. Another might be to use the invalid_raise=False parameter in the call to np.genfromtxt:

oompa = np.genfromtxt('C_20k_73_2.tsv',delimiter='\t',
            usecols=(0,1,2),unpack=True,skip_header=13,
            dtype=str, invalid_raise=False)

That will skip lines that are inconsistent with the number of columns np.genfromtxt expects to parse.

If the file is not too long, an easy way to look at the last few lines of the file is

print(open(filename, 'rb').read().splitlines()[:-3])

Since this prints a list, you get the repr of the items in the list without having to call repr directly. The repr makes it easy to see where the tabs and end-of-line characters are.

By examining the repr of the last lines successfully parsed by np.genfromtxt compared to the first lines skipped, you should be able to spot the break in pattern which is causing the problem.

If the file is very long, you can print the last few lines using

import collections
lines = collections.deque(maxlen=2)
with open('data', 'rb') as f:
    lines.extend(f)
print(list(lines))

The problem with open(filename, 'rb').read().splitlines() is that it reads the entire file into memory and then splits the huge string into a huge list. This can cause a MemoryError when the file is too large. The deque has a maximum number of elements, thus preventing the problem so long as the lines themselves are not too long.