I'm trying to read from a file containing characters like é, Ä, etc. I'm using numpy.loadtxt() but I'm getting UnicodeDecodeErrors as the decoder cannot parse them. My first priority is to preserve those characters if at all possible but if not, I would not mind resorting to replacing them. Any suggestions?
In addition to the link the @unutbu found (using decode/encode in genfromtxt
), here's a quick sketch of a direct file reader:
Sample file (utf8)
é, Ä
é, Ä
é, Ä
Readlines, split, and pass through np.array
:
In [327]: fn='uni_csv.txt'
In [328]: with open(fn) as f:lines=f.readlines()
In [329]: lines
Out[329]: ['é, Ä\n', 'é, Ä\n', 'é, Ä\n']
...
In [331]: [l.strip().split(',') for l in lines]
Out[331]: [['é', ' Ä'], ['é', ' Ä'], ['é', ' Ä']]
In [332]: np.array([l.strip().split(',') for l in lines])
Out[332]:
array([['é', ' Ä'],
['é', ' Ä'],
['é', ' Ä']],
dtype='<U2')
I don't think tab
separation posses a problem (except that my text editor is set to replace tabs with spaces).
For mixed datatypes, I need to add a tuple
conversion (a structured array definition requires a list of tuples):
In [343]: with open(fn) as f:lines=f.readlines()
In [344]: dt=np.dtype([('int',int),('é','|U2'),('Ä','U5')])
In [345]: np.array([tuple(l.strip().split(',')) for l in lines], dt)
Out[345]:
array([(1, ' é', ' Ä'), (2, ' é', ' Ä'), (3, ' é', ' Ä')],
dtype=[('int', '<i4'), ('é', '<U2'), ('Ä', '<U5')])
(I added an integer column to my text file)
Actually loadtxt
doesn't choke on this file and dtype either; it just loads the strings wrong.
In [349]: np.loadtxt('uni_csv.txt',dtype=dt, delimiter=',')
Out[349]:
array([(1, "b'", "b' \\x"), (2, "b'", "b' \\x"), (3, "b'", "b' \\x")],
dtype=[('int', '<i4'), ('é', '<U2'), ('Ä', '<U5')])