Search code examples
python-3.xnumpynon-ascii-characters

Parsing non-ASCII characters in Python 3


I'm trying to read from a file containing characters like é, Ä, etc. I'm using numpy.loadtxt() but I'm getting UnicodeDecodeErrors as the decoder cannot parse them. My first priority is to preserve those characters if at all possible but if not, I would not mind resorting to replacing them. Any suggestions?


Solution

  • In addition to the link the @unutbu found (using decode/encode in genfromtxt), here's a quick sketch of a direct file reader:

    Sample file (utf8)

    é, Ä
    é, Ä
    é, Ä
    

    Readlines, split, and pass through np.array:

    In [327]: fn='uni_csv.txt'
    In [328]: with open(fn) as f:lines=f.readlines()
    In [329]: lines
    Out[329]: ['é, Ä\n', 'é, Ä\n', 'é, Ä\n']
    ...
    In [331]: [l.strip().split(',') for l in lines]
    Out[331]: [['é', ' Ä'], ['é', ' Ä'], ['é', ' Ä']]
    In [332]: np.array([l.strip().split(',') for l in lines])
    Out[332]: 
    array([['é', ' Ä'],
           ['é', ' Ä'],
           ['é', ' Ä']], 
          dtype='<U2')
    

    I don't think tab separation posses a problem (except that my text editor is set to replace tabs with spaces).

    For mixed datatypes, I need to add a tuple conversion (a structured array definition requires a list of tuples):

    In [343]: with open(fn) as f:lines=f.readlines()
    In [344]: dt=np.dtype([('int',int),('é','|U2'),('Ä','U5')])
    In [345]: np.array([tuple(l.strip().split(',')) for l in lines], dt)
    Out[345]: 
    array([(1, ' é', ' Ä'), (2, ' é', ' Ä'), (3, ' é', ' Ä')], 
          dtype=[('int', '<i4'), ('é', '<U2'), ('Ä', '<U5')])
    

    (I added an integer column to my text file)


    Actually loadtxt doesn't choke on this file and dtype either; it just loads the strings wrong.

    In [349]: np.loadtxt('uni_csv.txt',dtype=dt, delimiter=',')
    Out[349]: 
    array([(1, "b'", "b' \\x"), (2, "b'", "b' \\x"), (3, "b'", "b' \\x")], 
          dtype=[('int', '<i4'), ('é', '<U2'), ('Ä', '<U5')])