I have a tsv-file from QuickDAQ with three columns of 200 000 values that I want to import into numpy. Problem is genfromtxt seems to miss the last line. The line is nothing out of the ordinary as far as I can see:
...
0,00232172012329102 0,0198968648910522 0,0049593448638916
0,00411009788513184 0,0142784118652344 0,00339150428771973
0,00499653816223145 0,00666630268096924 0,00308072566986084
Example of code that doesn't quite work:
In [245]: import numpy as np
In [246]: oompa = np.genfromtxt('C_20k_73_2.tsv',delimiter='\t',usecols=(0,1,2),unpack=True,skip_header=13,dtype=str)
In [248]: oompa[1]
Out[248]:
array(['-0,00884926319122314', '-0,00379836559295654',
'0,000106096267700195', ..., '0,0259654521942139',
'0,0198968648910522', '0,0142784118652344'],
dtype='<U21')
The file had windows-style line breaks, I've tried removing these in vi but it doesn't make a difference. What could cause this kind of behaviour from genfromtxt and how could it be dealt with, preferably without manually editing the tsv file?
Well, the file seems to have some lines with just tabs. I'm surprised np.genfromtxt
did not raise a ValueError
. One way to prevent the problem would be to remove those empty tab lines. Another might be to use the invalid_raise=False
parameter in the call to np.genfromtxt
:
oompa = np.genfromtxt('C_20k_73_2.tsv',delimiter='\t',
usecols=(0,1,2),unpack=True,skip_header=13,
dtype=str, invalid_raise=False)
That will skip lines that are inconsistent with the number of columns np.genfromtxt
expects to parse.
If the file is not too long, an easy way to look at the last few lines of the file is
print(open(filename, 'rb').read().splitlines()[:-3])
Since this prints a list, you get the repr
of the items in the list without having to call repr
directly. The repr
makes it easy to see where the tabs and end-of-line characters are.
By examining the repr
of the last lines successfully parsed by np.genfromtxt
compared to the first lines skipped, you should be able to spot the break in pattern which is causing the problem.
If the file is very long, you can print the last few lines using
import collections
lines = collections.deque(maxlen=2)
with open('data', 'rb') as f:
lines.extend(f)
print(list(lines))
The problem with open(filename, 'rb').read().splitlines()
is that it reads the entire file into memory and then splits the huge string into a huge list. This can cause a MemoryError when the file is too large. The deque
has a maximum number of elements, thus preventing the problem so long as the lines themselves are not too long.