Search code examples
pythonpandaspyarrowfeather

Exporting dataframe with null-able Int64 from pandas to R


I am trying to export a dataframe that contains among others categorial and nullable integer columns such that it can be easily read by R.

I put my bets on apache feather, but unfortunately the Int64 datatype from pandas does not seem to be implemented:

from pyarrow import feather
import pandas as pd

col1 = pd.Series([0, None, 1, 23]).astype('Int64')
col2 = pd.Series([1, 3, 2, 1]).astype('Int64')

df = pd.DataFrame({'a': col1, 'b': col2})

feather.write_feather(df, '/tmp/foo')

This is the error message one gets:

---------------------------------------------------------------------------
ArrowTypeError                            Traceback (most recent call last)
<ipython-input-107-8cc611a30355> in <module>
----> 1 feather.write_feather(df, '/tmp/foo')

~/miniconda3/envs/sci36/lib/python3.6/site-packages/pyarrow/feather.py in write_feather(df, dest)
    181     writer = FeatherWriter(dest)
    182     try:
--> 183         writer.write(df)
    184     except Exception:
    185         # Try to make sure the resource is closed

~/miniconda3/envs/sci36/lib/python3.6/site-packages/pyarrow/feather.py in write(self, df)
     92         # TODO(wesm): Remove this length check, see ARROW-1732
     93         if len(df.columns) > 0:
---> 94             table = Table.from_pandas(df, preserve_index=False)
     95             for i, name in enumerate(table.schema.names):
     96                 col = table[i]

~/miniconda3/envs/sci36/lib/python3.6/site-packages/pyarrow/table.pxi in pyarrow.lib.Table.from_pandas()

~/miniconda3/envs/sci36/lib/python3.6/site-packages/pyarrow/pandas_compat.py in dataframe_to_arrays(df, schema, preserve_index, nthreads, columns, safe)
    551     if nthreads == 1:
    552         arrays = [convert_column(c, f)
--> 553                   for c, f in zip(columns_to_convert, convert_fields)]
    554     else:
    555         from concurrent import futures

~/miniconda3/envs/sci36/lib/python3.6/site-packages/pyarrow/pandas_compat.py in <listcomp>(.0)
    551     if nthreads == 1:
    552         arrays = [convert_column(c, f)
--> 553                   for c, f in zip(columns_to_convert, convert_fields)]
    554     else:
    555         from concurrent import futures

~/miniconda3/envs/sci36/lib/python3.6/site-packages/pyarrow/pandas_compat.py in convert_column(col, field)
    542             e.args += ("Conversion failed for column {0!s} with type {1!s}"
    543                        .format(col.name, col.dtype),)
--> 544             raise e
    545         if not field_nullable and result.null_count > 0:
    546             raise ValueError("Field {} was non-nullable but pandas column "

~/miniconda3/envs/sci36/lib/python3.6/site-packages/pyarrow/pandas_compat.py in convert_column(col, field)
    536 
    537         try:
--> 538             result = pa.array(col, type=type_, from_pandas=True, safe=safe)
    539         except (pa.ArrowInvalid,
    540                 pa.ArrowNotImplementedError,

ArrowTypeError: ('Did not pass numpy.dtype object', 'Conversion failed for column a with type Int64')

Is there a workaround that allows me to use this special Int64 datatype, preferably using pyarrow?


Solution

  • With the latest Arrow release (pyarrow 0.15.0), and when using pandas development version, this is now supported:

    In [1]: from pyarrow import feather 
       ...: import pandas as pd 
       ...:  
       ...: col1 = pd.Series([0, None, 1, 23]).astype('Int64') 
       ...: col2 = pd.Series([1, 3, 2, 1]).astype('Int64') 
       ...:  
       ...: df = pd.DataFrame({'a': col1, 'b': col2}) 
       ...:  
       ...: feather.write_feather(df, '/tmp/foo') 
    
    In [2]: feather.read_table('/tmp/foo')
    Out[2]: 
    pyarrow.Table
    a: int64
    b: int64
    

    You can see that the resulting arrow table (when read back in) properly has integer columns. So to have this in a released version, it is waiting until pandas 1.0.

    For now (without using pandas master), you have two workaround options:

    • Convert the column to an object dtype column (df['a'] = df['a'].astype(object)), and then write to feather. For those object columns (with integers and missing values), pyarrow will correctly infer it are integers.

    • Monkeypatch pandas for now (until the next pandas release):

      pd.arrays.IntegerArray.__arrow_array__ = lambda self, type: pyarrow.array(self._data, mask=self._mask, type=type)
      

      With that, writing nullable integer columns with pyarrow / feather should work out of the box (you still need the latest pyarrow 0.15.0 for this).


    Note that reading the feather file back in to a pandas DataFrame will for now still result in a float column (if there are missing values), as that is the default conversion of arrow integer to pandas. There is work going on to also preserve those specific pandas types when converting to pandas (see https://issues.apache.org/jira/browse/ARROW-2428).