This code attempts to create a pyarrow
table to store it in parquet
, but get an error when converting from numpy
array. What is this error and how to fix it?
import numpy as np
import pyarrow as pa
import pyarrow.parquet as pq
array = np.array(
[
[1.1, 1.2, 1.4],
[1, 2, 5],
['a', 'b', 'c']
]
)
fields = [pa.field('field1',pa.float64()),
pa.field('field2',pa.int64()),
pa.field('field3',pa.string())]
array_table = pa.Table.from_arrays(array, schema=pa.schema(fields))
from_arrays throws:
ArrowNotImplementedError: NumPy type not implemented: unrecognized type (19) in GetNumPyTypeName
Numpy array can't have heterogeneous types (int, float string in the same array). So in this case the array is of type type <U32
(a little-endian Unicode string of 32 characters, in other word string).
>>> array.dtype
dtype('<U32')
So the ints and floats get converted to string, and arrow would have to convert hte strings back to int, float respectively.
But Arrow is unable to convert from numpy string to int and float:
pa.array(np.array([1,2,3], dtype='<U32'), pa.int32())
>>> NumPy type not implemented: unrecognized type (19) in GetNumPyTypeName
Instead you should have one array for each column in your table, each of their own type, and it should work:
arrays = [
np.array([1.1, 1.2, 1.4]),
np.array([1, 2, 5]),
np.array(['a', 'b', 'c'])
]
fields = [pa.field('field1',pa.float64()),
pa.field('field2',pa.int64()),
pa.field('field3',pa.string())]
array_table = pa.Table.from_arrays(arrays, schema=pa.schema(fields))