Search code examples
pythonpandascsvdata-scienceparquet

Type error on first steps with Apache Parquet


Rather confused by running into this type error while trying out the Apache Parquet file format for the first time. Shouldn't Parquet support all the data types that Pandas does? What am I missing?

import pandas
import pyarrow
import numpy

data = pandas.read_csv("data/BigData.csv", sep="|", encoding="iso-8859-1")
data_parquet = pyarrow.Table.from_pandas(data)

raises:

---------------------------------------------------------------------------
ArrowInvalid                              Traceback (most recent call last)
<ipython-input-9-90533507bcf2> in <module>()
----> 1 data_parquet = pyarrow.Table.from_pandas(data)

table.pxi in pyarrow.lib.Table.from_pandas()

~\AppData\Local\Continuum\anaconda3\lib\site-packages\pyarrow\pandas_compat.py in dataframe_to_arrays(df, schema, preserve_index, nthreads)
    354             arrays = list(executor.map(convert_column,
    355                                        columns_to_convert,
--> 356                                        convert_types))
    357 
    358     types = [x.type for x in arrays]

~\AppData\Local\Continuum\anaconda3\lib\concurrent\futures\_base.py in result_iterator()
    584                     # Careful not to keep a reference to the popped future
    585                     if timeout is None:
--> 586                         yield fs.pop().result()
    587                     else:
    588                         yield fs.pop().result(end_time - time.time())

~\AppData\Local\Continuum\anaconda3\lib\concurrent\futures\_base.py in result(self, timeout)
    423                 raise CancelledError()
    424             elif self._state == FINISHED:
--> 425                 return self.__get_result()
    426 
    427             self._condition.wait(timeout)

~\AppData\Local\Continuum\anaconda3\lib\concurrent\futures\_base.py in __get_result(self)
    382     def __get_result(self):
    383         if self._exception:
--> 384             raise self._exception
    385         else:
    386             return self._result

~\AppData\Local\Continuum\anaconda3\lib\concurrent\futures\thread.py in run(self)
     54 
     55         try:
---> 56             result = self.fn(*self.args, **self.kwargs)
     57         except BaseException as exc:
     58             self.future.set_exception(exc)

~\AppData\Local\Continuum\anaconda3\lib\site-packages\pyarrow\pandas_compat.py in convert_column(col, ty)
    343 
    344     def convert_column(col, ty):
--> 345         return pa.array(col, from_pandas=True, type=ty)
    346 
    347     if nthreads == 1:

array.pxi in pyarrow.lib.array()

array.pxi in pyarrow.lib._ndarray_to_array()

error.pxi in pyarrow.lib.check_status()

ArrowInvalid: Error converting from Python objects to Int64: Got Python object of type str but can only handle these types: integer

data.dtypes is:

0      object
1      object
2      object
3      object
4      object
5     float64
6     float64
7      object
8      object
9      object
10     object
11     object
12     object
13    float64
14     object
15    float64
16     object
17    float64
...

Solution

  • In Apache Arrow, table columns must be homogeneous in their data types. pandas supports Python object columns where values can be different types. So you will need to do some data scrubbing before writing to Parquet format.

    We've handled some rudimentary cases (like both bytes and unicode in a column) in the Arrow-Python bindings but we don't hazard any guesses about how to handle bad data. I opened the JIRA https://issues.apache.org/jira/browse/ARROW-2098 about adding an option to coerce unexpected values to null in situations like this, which might help in the future.