Search code examples
parquetawkward-array

Why can't we convert flat columns of awkward1 arrays `to_parquet`?


A follow up from this question; Best way to save a dict of awkward1 arrays?

To save multiple columns of nested awkward1 arrays (with varying length);

import awkward1 as ak
dog = ak.from_iter([[1, 2], [5]])
cat = ak.from_iter([[4]])
pets = ak.zip({"dog": dog[np.newaxis], "cat": cat[np.newaxis]}, depth_limit=1)

ak.to_parquet(pets, "pets.parquet")

Unfortunately, this doesn't seem to work for flat lists;

import awkward1 as ak
dog = ak.from_iter([1, 2, 5])
cat = ak.from_iter([4])
pets = ak.zip({"dog": dog[np.newaxis], "cat": cat[np.newaxis]}, depth_limit=1)

ak.to_parquet(pets, "pets.parquet")

creates the error;

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-31-7f3a7fefb261> in <module>
      3 cat = ak.from_iter([3])
      4 pets = ak.zip({"dog": dog[np.newaxis], "cat": cat[np.newaxis]}, depth_limit=1)
----> 5 ak.to_parquet(pets, "pets.parquet")

~/Programs/anaconda3/envs/tree/lib/python3.7/site-packages/awkward/operations/convert.py in to_parquet(array, where, explode_records, list_to32, string_to32, bytestring_to32, **options)
   2983     layout = to_layout(array, allow_record=False, allow_other=False)
   2984     iterator = batch_iterator(layout)
-> 2985     first = next(iterator)
   2986
   2987     if "schema" not in options:

~/Programs/anaconda3/envs/tree/lib/python3.7/site-packages/awkward/operations/convert.py in batch_iterator(layout)
   2978                 )
   2979             yield pyarrow.RecordBatch.from_arrays(
-> 2980                 pa_arrays, schema=pyarrow.schema(pa_fields)
   2981             )
   2982

~/Programs/anaconda3/envs/tree/lib/python3.7/site-packages/pyarrow/table.pxi in pyarrow.lib.RecordBatch.from_arrays()

TypeError: object of type 'pyarrow.lib.Tensor' has no len()

What is the reason for encountering this error?


Solution

  • What you found is a bug, and now it is fixed: https://github.com/scikit-hep/awkward-1.0/pull/799

    What's happening here is that pyarrow can't write pyarrow.lib.Tensor (regular-length lists, such as the one you created with np.newaxis) to Parquet files. Parquet files don't have a concept of "regular-length list," so that makes sense. But rather than converting it, pyarrow hits an unhandled case, in which it fails to find the length of that pyarrow.lib.Tensor. (It's a little odd that pyarrow.lib.Tensor doesn't have a __len__ method, but that's another thing.)

    Anyway, with version 1.2.0 of Awkward Array, we'll simply convert regular-length lists into (in principle) variable-length lists when writing to Parquet, since the format doesn't have that type. According to the schedule, version 1.2.0 will be released tomorrow. (This bug-fix is likely the last prerelease.)