I am trying to export a dataframe that contains among others categorial and nullable integer columns such that it can be easily read by R.
I put my bets on apache feather, but unfortunately the Int64
datatype from pandas does not seem to be implemented:
from pyarrow import feather
import pandas as pd
col1 = pd.Series([0, None, 1, 23]).astype('Int64')
col2 = pd.Series([1, 3, 2, 1]).astype('Int64')
df = pd.DataFrame({'a': col1, 'b': col2})
feather.write_feather(df, '/tmp/foo')
This is the error message one gets:
---------------------------------------------------------------------------
ArrowTypeError Traceback (most recent call last)
<ipython-input-107-8cc611a30355> in <module>
----> 1 feather.write_feather(df, '/tmp/foo')
~/miniconda3/envs/sci36/lib/python3.6/site-packages/pyarrow/feather.py in write_feather(df, dest)
181 writer = FeatherWriter(dest)
182 try:
--> 183 writer.write(df)
184 except Exception:
185 # Try to make sure the resource is closed
~/miniconda3/envs/sci36/lib/python3.6/site-packages/pyarrow/feather.py in write(self, df)
92 # TODO(wesm): Remove this length check, see ARROW-1732
93 if len(df.columns) > 0:
---> 94 table = Table.from_pandas(df, preserve_index=False)
95 for i, name in enumerate(table.schema.names):
96 col = table[i]
~/miniconda3/envs/sci36/lib/python3.6/site-packages/pyarrow/table.pxi in pyarrow.lib.Table.from_pandas()
~/miniconda3/envs/sci36/lib/python3.6/site-packages/pyarrow/pandas_compat.py in dataframe_to_arrays(df, schema, preserve_index, nthreads, columns, safe)
551 if nthreads == 1:
552 arrays = [convert_column(c, f)
--> 553 for c, f in zip(columns_to_convert, convert_fields)]
554 else:
555 from concurrent import futures
~/miniconda3/envs/sci36/lib/python3.6/site-packages/pyarrow/pandas_compat.py in <listcomp>(.0)
551 if nthreads == 1:
552 arrays = [convert_column(c, f)
--> 553 for c, f in zip(columns_to_convert, convert_fields)]
554 else:
555 from concurrent import futures
~/miniconda3/envs/sci36/lib/python3.6/site-packages/pyarrow/pandas_compat.py in convert_column(col, field)
542 e.args += ("Conversion failed for column {0!s} with type {1!s}"
543 .format(col.name, col.dtype),)
--> 544 raise e
545 if not field_nullable and result.null_count > 0:
546 raise ValueError("Field {} was non-nullable but pandas column "
~/miniconda3/envs/sci36/lib/python3.6/site-packages/pyarrow/pandas_compat.py in convert_column(col, field)
536
537 try:
--> 538 result = pa.array(col, type=type_, from_pandas=True, safe=safe)
539 except (pa.ArrowInvalid,
540 pa.ArrowNotImplementedError,
ArrowTypeError: ('Did not pass numpy.dtype object', 'Conversion failed for column a with type Int64')
Is there a workaround that allows me to use this special Int64
datatype, preferably using pyarrow?
With the latest Arrow release (pyarrow 0.15.0), and when using pandas development version, this is now supported:
In [1]: from pyarrow import feather
...: import pandas as pd
...:
...: col1 = pd.Series([0, None, 1, 23]).astype('Int64')
...: col2 = pd.Series([1, 3, 2, 1]).astype('Int64')
...:
...: df = pd.DataFrame({'a': col1, 'b': col2})
...:
...: feather.write_feather(df, '/tmp/foo')
In [2]: feather.read_table('/tmp/foo')
Out[2]:
pyarrow.Table
a: int64
b: int64
You can see that the resulting arrow table (when read back in) properly has integer columns. So to have this in a released version, it is waiting until pandas 1.0.
For now (without using pandas master), you have two workaround options:
Convert the column to an object dtype column (df['a'] = df['a'].astype(object)
), and then write to feather. For those object columns (with integers and missing values), pyarrow will correctly infer it are integers.
Monkeypatch pandas for now (until the next pandas release):
pd.arrays.IntegerArray.__arrow_array__ = lambda self, type: pyarrow.array(self._data, mask=self._mask, type=type)
With that, writing nullable integer columns with pyarrow / feather should work out of the box (you still need the latest pyarrow 0.15.0 for this).
Note that reading the feather file back in to a pandas DataFrame will for now still result in a float column (if there are missing values), as that is the default conversion of arrow integer to pandas. There is work going on to also preserve those specific pandas types when converting to pandas (see https://issues.apache.org/jira/browse/ARROW-2428).