python pandas dataframe pyarrow awkward-array

To_parquet giving error: __arrow_array__() got an unexpected keyword argument 'type'

I'm reading a root file using uproot and converting parts of it into a DataFrame using the arrays method.
This works fine, until I try to save to parquet using the to_parquet method on the dataframe. Sample code is given below.

# First three lines are here to rename the columns and choose what data to keep
data = pd.read_csv(dictFile, header = None, delim_whitespace=True)
dataFile, dataKey = data[0], data[1]
content_ele = {dataKey[i]: dataFile[i] for i in np.arange(len(dataKey))}

# We run over the different files to save a simplified version of them.
file_list = pd.read_csv(file_list_loc, names=["Loc"])

for file_loc in file_list.Loc:

    tree = uproot.open(f"{file_path}/{file_loc}:CollectionTree")

    arrays = tree.arrays(dataKey, library="pd").rename(columns=content_ele)

    save_loc = f"{save_path}/{file_loc[:-6]}reduced.parquet"
    arrays.to_parquet(path=save_loc)

Doing so, results in the following error: _arrow_array_() got an unexpected keyword argument 'type'
It seems to originate from pa.array, if that helps out.

Of note, the most simplest dataframe I've had this error with has 2 columns with different length awkward arrays (awkward.highlevel.Array) in each row but the same for each column. An example is given below.

           A                      B
0   [31, 26, 17, 23]    [-2.1, 1.3, 0.5, -0.4]
1   [75, 15, 49]        [2.4, -1.8, 0.8] 
2   [58, 45, 64, 47]    [-1.9, -0.4, -2.5, 1.3]
3   [26]                [-1.1]

I've tried both reducing what elements I run on, such as only integers, reducing amount of columns as above.
However, running this exact same method with to_json gives no errors. The problem with that method is that once I read it again, what was previously awkward arrays are now just lists, making it much more impractical to work with whenever I may want to calculate something like array.A/2. Yes, I could just convert it, but it seems wiser to keep the original format and it is easier since I don't have to do it each time.

Solution

Solution: Upgrade your awkward-pandas package. When I first tried to reproduce your problem with awkward-pandas version 2022.12a1, I saw the same error, then I upgraded to 2023.8.0 and it's gone.

Detective work: I'm writing all of this down because I'm so proud of myself. :)

I'm guessing that the data in f"{file_path}/{file_loc}:CollectionTree" is ragged. There's no indication of this in your example, but if it were purely numerical data types (no variable-length lists or nested data structures), then the arrays would be a normal Pandas DataFrame. If, in that case, you got an error, it would be a Pandas error—possible, but less likely because someone else would have noticed it first.

So assuming that arrays is a DataFrame of ragged data (and this is Uproot >= 5.0), the data types in each column are managed with awkward-pandas. If so, I should be able to reproduce the error like this:

>>> import awkward as ak
>>> import pandas as pd
>>> import awkward_pandas
>>> ragged_array = ak.Array([[0, 1, 2], [], [3, 4], [5], [6, 7, 8, 9]])
>>> ak_ext_array = awkward_pandas.AwkwardExtensionArray(ragged_array)
>>> df = pd.DataFrame({"column": ak_ext_array})
>>> df
         column
0     [0, 1, 2]
1            []
2        [3, 4]
3           [5]
4  [6, 7, 8, 9]
>>> df.to_parquet("/tmp/file.parquet")

and I do (with awkward-pandas version 2022.12a1):

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/jpivarski/mambaforge/lib/python3.9/site-packages/pandas/core/frame.py", line 2889, in to_parquet
    return to_parquet(
  File "/home/jpivarski/mambaforge/lib/python3.9/site-packages/pandas/io/parquet.py", line 411, in to_parquet
    impl.write(
  File "/home/jpivarski/mambaforge/lib/python3.9/site-packages/pandas/io/parquet.py", line 159, in write
    table = self.api.Table.from_pandas(df, **from_pandas_kwargs)
  File "pyarrow/table.pxi", line 3480, in pyarrow.lib.Table.from_pandas
  File "/home/jpivarski/mambaforge/lib/python3.9/site-packages/pyarrow/pandas_compat.py", line 609, in dataframe_to_arrays
    arrays = [convert_column(c, f)
  File "/home/jpivarski/mambaforge/lib/python3.9/site-packages/pyarrow/pandas_compat.py", line 609, in <listcomp>
    arrays = [convert_column(c, f)
  File "/home/jpivarski/mambaforge/lib/python3.9/site-packages/pyarrow/pandas_compat.py", line 590, in convert_column
    result = pa.array(col, type=type_, from_pandas=True, safe=safe)
  File "pyarrow/array.pxi", line 263, in pyarrow.lib.array
  File "pyarrow/array.pxi", line 110, in pyarrow.lib._handle_arrow_array_protocol
TypeError: __arrow_array__() got an unexpected keyword argument 'type'

(For the future: including a whole stack trace would remove a lot of guesswork.)

I first thought, "Maybe awkward-pandas hasn't implemented the __arrow_array__ protocol." But no, the AwkwardExtensionArray has an __arrow_array__ method:

>>> ak_ext_array.__arrow_array__()
<pyarrow.lib.ChunkedArray object at 0x7ff422d0b9f0>
[
  [
    [
      0,
      1,
      2
    ],
    [],
    ...
    [
      5
    ],
    [
      6,
      7,
      8,
      9
    ]
  ]
]

Then, "Maybe it has an __arrow_array__ method, but that method doesn't take a type argument," which is what the error message is saying.

>>> help(ak_ext_array.__arrow_array__)
Help on method __arrow_array__ in module awkward_pandas.array:
__arrow_array__() method of awkward_pandas.array.AwkwardExtensionArray instance

Aha! That's it! So I was about to write an issue on awkward-pandas, and in so doing, point out the function definition that's missing a type argument. But the function definition isn't missing a type argument.

https://github.com/intake/awkward-pandas/blob/1f8cf19fdc9cb0786642f39cfaf7c084c3c5c9bc/src/awkward_pandas/array.py#L148-L151

It's just that my copy of the package was old. This is an old bug that has since been fixed.

I upgraded my awkward-pandas and it all works now:

>>> df.to_parquet("/tmp/file.parquet")

(no errors)

>>> ak.from_parquet("/tmp/file.parquet").show()
[{column: [0, 1, 2]},
 {column: []},
 {column: [3, 4]},
 {column: [5]},
 {column: [6, 7, 8, 9]}]

(reads back appropriately)