Search code examples
pythonpandaspyarroworc

Write ORC using Pandas with all values of sequence None


I want to write a simple dataframe as an ORC file. The only sequence is of an integer type. If I set all values to None, an exception is raised on to_orc.

I understand that pyarrow cannot infer datatype from None values but what can I do to fix the datatype for output? Attempts to use .astype() only brought TypeError: int() argument must be a string, a bytes-like object or a number, not 'NoneType'

Bonus points if the solution also works for

  1. empty dataframes
  2. nested types

Script:

data = {'a': [1, 2]}
df = pd.DataFrame(data=data) 
print(df)
df.to_orc('a.orc')  # OK
df['a'] = None 
print(df) 
df.to_orc('a.orc')  # fails 

Output:

   a
0  1
1  2
      a
0  None
1  None
Traceback (most recent call last):
  File ... line 9, in <module>
  ...
  File "pyarrow/_orc.pyx", line 443, in pyarrow._orc.ORCWriter.write
  File "pyarrow/error.pxi", line 121, in pyarrow.lib.check_status
pyarrow.lib.ArrowNotImplementedError: Unknown or unsupported Arrow type: null

Solution

  • This is a known issue, see https://github.com/apache/arrow/issues/30317. The problem is that the ORC writer does not yet support writing a column of all-nulls without specific dtype (not object dtype). If you cast the column to, for example, float first, then the writing works.

    Using the df from your example:

    >>> df.dtypes
    a    object
    dtype: object
    
    # the column has generic object dtype, cast to float
    >>> df['a'] = df['a'].astype("float64")
    >>> df.dtypes
    a    float64
    dtype: object
    
    # now writing to ORC and reading back works
    >>> df.to_orc('a.orc')
    >>> pd.read_orc('a.orc')
        a
    0 NaN
    1 NaN