Search code examples
pythonpandasdataframepyarrowawkward-array

To_parquet giving error: __arrow_array__() got an unexpected keyword argument 'type'


I'm reading a root file using uproot and converting parts of it into a DataFrame using the arrays method.
This works fine, until I try to save to parquet using the to_parquet method on the dataframe. Sample code is given below.

# First three lines are here to rename the columns and choose what data to keep
data = pd.read_csv(dictFile, header = None, delim_whitespace=True)
dataFile, dataKey = data[0], data[1]
content_ele = {dataKey[i]: dataFile[i] for i in np.arange(len(dataKey))}

# We run over the different files to save a simplified version of them.
file_list = pd.read_csv(file_list_loc, names=["Loc"])

for file_loc in file_list.Loc:

    tree = uproot.open(f"{file_path}/{file_loc}:CollectionTree")

    arrays = tree.arrays(dataKey, library="pd").rename(columns=content_ele)

    save_loc = f"{save_path}/{file_loc[:-6]}reduced.parquet"
    arrays.to_parquet(path=save_loc)

Doing so, results in the following error: _arrow_array_() got an unexpected keyword argument 'type'
It seems to originate from pa.array, if that helps out.

Of note, the most simplest dataframe I've had this error with has 2 columns with different length awkward arrays (awkward.highlevel.Array) in each row but the same for each column. An example is given below.

           A                      B
0   [31, 26, 17, 23]    [-2.1, 1.3, 0.5, -0.4]
1   [75, 15, 49]        [2.4, -1.8, 0.8] 
2   [58, 45, 64, 47]    [-1.9, -0.4, -2.5, 1.3]
3   [26]                [-1.1] 

I've tried both reducing what elements I run on, such as only integers, reducing amount of columns as above.
However, running this exact same method with to_json gives no errors. The problem with that method is that once I read it again, what was previously awkward arrays are now just lists, making it much more impractical to work with whenever I may want to calculate something like array.A/2. Yes, I could just convert it, but it seems wiser to keep the original format and it is easier since I don't have to do it each time.


Solution

  • Solution: Upgrade your awkward-pandas package. When I first tried to reproduce your problem with awkward-pandas version 2022.12a1, I saw the same error, then I upgraded to 2023.8.0 and it's gone.

    Detective work: I'm writing all of this down because I'm so proud of myself. :)

    I'm guessing that the data in f"{file_path}/{file_loc}:CollectionTree" is ragged. There's no indication of this in your example, but if it were purely numerical data types (no variable-length lists or nested data structures), then the arrays would be a normal Pandas DataFrame. If, in that case, you got an error, it would be a Pandas error—possible, but less likely because someone else would have noticed it first.

    So assuming that arrays is a DataFrame of ragged data (and this is Uproot >= 5.0), the data types in each column are managed with awkward-pandas. If so, I should be able to reproduce the error like this:

    >>> import awkward as ak
    >>> import pandas as pd
    >>> import awkward_pandas
    >>> ragged_array = ak.Array([[0, 1, 2], [], [3, 4], [5], [6, 7, 8, 9]])
    >>> ak_ext_array = awkward_pandas.AwkwardExtensionArray(ragged_array)
    >>> df = pd.DataFrame({"column": ak_ext_array})
    >>> df
             column
    0     [0, 1, 2]
    1            []
    2        [3, 4]
    3           [5]
    4  [6, 7, 8, 9]
    >>> df.to_parquet("/tmp/file.parquet")
    

    and I do (with awkward-pandas version 2022.12a1):

    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
      File "/home/jpivarski/mambaforge/lib/python3.9/site-packages/pandas/core/frame.py", line 2889, in to_parquet
        return to_parquet(
      File "/home/jpivarski/mambaforge/lib/python3.9/site-packages/pandas/io/parquet.py", line 411, in to_parquet
        impl.write(
      File "/home/jpivarski/mambaforge/lib/python3.9/site-packages/pandas/io/parquet.py", line 159, in write
        table = self.api.Table.from_pandas(df, **from_pandas_kwargs)
      File "pyarrow/table.pxi", line 3480, in pyarrow.lib.Table.from_pandas
      File "/home/jpivarski/mambaforge/lib/python3.9/site-packages/pyarrow/pandas_compat.py", line 609, in dataframe_to_arrays
        arrays = [convert_column(c, f)
      File "/home/jpivarski/mambaforge/lib/python3.9/site-packages/pyarrow/pandas_compat.py", line 609, in <listcomp>
        arrays = [convert_column(c, f)
      File "/home/jpivarski/mambaforge/lib/python3.9/site-packages/pyarrow/pandas_compat.py", line 590, in convert_column
        result = pa.array(col, type=type_, from_pandas=True, safe=safe)
      File "pyarrow/array.pxi", line 263, in pyarrow.lib.array
      File "pyarrow/array.pxi", line 110, in pyarrow.lib._handle_arrow_array_protocol
    TypeError: __arrow_array__() got an unexpected keyword argument 'type'
    

    (For the future: including a whole stack trace would remove a lot of guesswork.)

    I first thought, "Maybe awkward-pandas hasn't implemented the __arrow_array__ protocol." But no, the AwkwardExtensionArray has an __arrow_array__ method:

    >>> ak_ext_array.__arrow_array__()
    <pyarrow.lib.ChunkedArray object at 0x7ff422d0b9f0>
    [
      [
        [
          0,
          1,
          2
        ],
        [],
        ...
        [
          5
        ],
        [
          6,
          7,
          8,
          9
        ]
      ]
    ]
    

    Then, "Maybe it has an __arrow_array__ method, but that method doesn't take a type argument," which is what the error message is saying.

    >>> help(ak_ext_array.__arrow_array__)
    Help on method __arrow_array__ in module awkward_pandas.array:
    __arrow_array__() method of awkward_pandas.array.AwkwardExtensionArray instance
    

    Aha! That's it! So I was about to write an issue on awkward-pandas, and in so doing, point out the function definition that's missing a type argument. But the function definition isn't missing a type argument.

    https://github.com/intake/awkward-pandas/blob/1f8cf19fdc9cb0786642f39cfaf7c084c3c5c9bc/src/awkward_pandas/array.py#L148-L151

    It's just that my copy of the package was old. This is an old bug that has since been fixed.

    I upgraded my awkward-pandas and it all works now:

    >>> df.to_parquet("/tmp/file.parquet")
    

    (no errors)

    >>> ak.from_parquet("/tmp/file.parquet").show()
    [{column: [0, 1, 2]},
     {column: []},
     {column: [3, 4]},
     {column: [5]},
     {column: [6, 7, 8, 9]}]
    

    (reads back appropriately)