Search code examples
pythonparquetpyarrow

pyarrow throws ArrowNotImplementedError when creating table from numpy array


This code attempts to create a pyarrow table to store it in parquet, but get an error when converting from numpy array. What is this error and how to fix it?

import numpy as np
import pyarrow as pa
import pyarrow.parquet as pq

array = np.array(
    [
        [1.1, 1.2, 1.4],
        [1, 2, 5],
        ['a', 'b', 'c']
    ]
)

fields = [pa.field('field1',pa.float64()),
          pa.field('field2',pa.int64()),
          pa.field('field3',pa.string())]

array_table = pa.Table.from_arrays(array, schema=pa.schema(fields))

from_arrays throws:

ArrowNotImplementedError: NumPy type not implemented: unrecognized type (19) in GetNumPyTypeName

Solution

  • Numpy array can't have heterogeneous types (int, float string in the same array). So in this case the array is of type type <U32 (a little-endian Unicode string of 32 characters, in other word string).

    >>> array.dtype
    dtype('<U32')
    

    So the ints and floats get converted to string, and arrow would have to convert hte strings back to int, float respectively.

    But Arrow is unable to convert from numpy string to int and float:

    pa.array(np.array([1,2,3], dtype='<U32'), pa.int32())
    >>> NumPy type not implemented: unrecognized type (19) in GetNumPyTypeName
    

    Instead you should have one array for each column in your table, each of their own type, and it should work:

    arrays =     [
            np.array([1.1, 1.2, 1.4]),
            np.array([1, 2, 5]),
            np.array(['a', 'b', 'c'])
        ]
    
    
    fields = [pa.field('field1',pa.float64()),
              pa.field('field2',pa.int64()),
              pa.field('field3',pa.string())]
    
    array_table = pa.Table.from_arrays(arrays, schema=pa.schema(fields))