Rather confused by running into this type error while trying out the Apache Parquet file format for the first time. Shouldn't Parquet support all the data types that Pandas does? What am I missing?
import pandas
import pyarrow
import numpy
data = pandas.read_csv("data/BigData.csv", sep="|", encoding="iso-8859-1")
data_parquet = pyarrow.Table.from_pandas(data)
raises:
---------------------------------------------------------------------------
ArrowInvalid Traceback (most recent call last)
<ipython-input-9-90533507bcf2> in <module>()
----> 1 data_parquet = pyarrow.Table.from_pandas(data)
table.pxi in pyarrow.lib.Table.from_pandas()
~\AppData\Local\Continuum\anaconda3\lib\site-packages\pyarrow\pandas_compat.py in dataframe_to_arrays(df, schema, preserve_index, nthreads)
354 arrays = list(executor.map(convert_column,
355 columns_to_convert,
--> 356 convert_types))
357
358 types = [x.type for x in arrays]
~\AppData\Local\Continuum\anaconda3\lib\concurrent\futures\_base.py in result_iterator()
584 # Careful not to keep a reference to the popped future
585 if timeout is None:
--> 586 yield fs.pop().result()
587 else:
588 yield fs.pop().result(end_time - time.time())
~\AppData\Local\Continuum\anaconda3\lib\concurrent\futures\_base.py in result(self, timeout)
423 raise CancelledError()
424 elif self._state == FINISHED:
--> 425 return self.__get_result()
426
427 self._condition.wait(timeout)
~\AppData\Local\Continuum\anaconda3\lib\concurrent\futures\_base.py in __get_result(self)
382 def __get_result(self):
383 if self._exception:
--> 384 raise self._exception
385 else:
386 return self._result
~\AppData\Local\Continuum\anaconda3\lib\concurrent\futures\thread.py in run(self)
54
55 try:
---> 56 result = self.fn(*self.args, **self.kwargs)
57 except BaseException as exc:
58 self.future.set_exception(exc)
~\AppData\Local\Continuum\anaconda3\lib\site-packages\pyarrow\pandas_compat.py in convert_column(col, ty)
343
344 def convert_column(col, ty):
--> 345 return pa.array(col, from_pandas=True, type=ty)
346
347 if nthreads == 1:
array.pxi in pyarrow.lib.array()
array.pxi in pyarrow.lib._ndarray_to_array()
error.pxi in pyarrow.lib.check_status()
ArrowInvalid: Error converting from Python objects to Int64: Got Python object of type str but can only handle these types: integer
data.dtypes
is:
0 object
1 object
2 object
3 object
4 object
5 float64
6 float64
7 object
8 object
9 object
10 object
11 object
12 object
13 float64
14 object
15 float64
16 object
17 float64
...
In Apache Arrow, table columns must be homogeneous in their data types. pandas supports Python object columns where values can be different types. So you will need to do some data scrubbing before writing to Parquet format.
We've handled some rudimentary cases (like both bytes and unicode in a column) in the Arrow-Python bindings but we don't hazard any guesses about how to handle bad data. I opened the JIRA https://issues.apache.org/jira/browse/ARROW-2098 about adding an option to coerce unexpected values to null in situations like this, which might help in the future.