I'm reading a root file using uproot and converting parts of it into a DataFrame using the arrays method.
This works fine, until I try to save to parquet using the to_parquet method on the dataframe. Sample code is given below.
# First three lines are here to rename the columns and choose what data to keep
data = pd.read_csv(dictFile, header = None, delim_whitespace=True)
dataFile, dataKey = data[0], data[1]
content_ele = {dataKey[i]: dataFile[i] for i in np.arange(len(dataKey))}
# We run over the different files to save a simplified version of them.
file_list = pd.read_csv(file_list_loc, names=["Loc"])
for file_loc in file_list.Loc:
tree = uproot.open(f"{file_path}/{file_loc}:CollectionTree")
arrays = tree.arrays(dataKey, library="pd").rename(columns=content_ele)
save_loc = f"{save_path}/{file_loc[:-6]}reduced.parquet"
arrays.to_parquet(path=save_loc)
Doing so, results in the following error: _arrow_array_() got an unexpected keyword argument 'type'
It seems to originate from pa.array, if that helps out.
Of note, the most simplest dataframe I've had this error with has 2 columns with different length awkward arrays (awkward.highlevel.Array) in each row but the same for each column. An example is given below.
A B
0 [31, 26, 17, 23] [-2.1, 1.3, 0.5, -0.4]
1 [75, 15, 49] [2.4, -1.8, 0.8]
2 [58, 45, 64, 47] [-1.9, -0.4, -2.5, 1.3]
3 [26] [-1.1]
I've tried both reducing what elements I run on, such as only integers, reducing amount of columns as above.
However, running this exact same method with to_json gives no errors. The problem with that method is that once I read it again, what was previously awkward arrays are now just lists, making it much more impractical to work with whenever I may want to calculate something like array.A/2
. Yes, I could just convert it, but it seems wiser to keep the original format and it is easier since I don't have to do it each time.
Solution: Upgrade your awkward-pandas
package. When I first tried to reproduce your problem with awkward-pandas
version 2022.12a1, I saw the same error, then I upgraded to 2023.8.0 and it's gone.
Detective work: I'm writing all of this down because I'm so proud of myself. :)
I'm guessing that the data in f"{file_path}/{file_loc}:CollectionTree"
is ragged. There's no indication of this in your example, but if it were purely numerical data types (no variable-length lists or nested data structures), then the arrays
would be a normal Pandas DataFrame. If, in that case, you got an error, it would be a Pandas error—possible, but less likely because someone else would have noticed it first.
So assuming that arrays
is a DataFrame of ragged data (and this is Uproot >= 5.0), the data types in each column are managed with awkward-pandas. If so, I should be able to reproduce the error like this:
>>> import awkward as ak
>>> import pandas as pd
>>> import awkward_pandas
>>> ragged_array = ak.Array([[0, 1, 2], [], [3, 4], [5], [6, 7, 8, 9]])
>>> ak_ext_array = awkward_pandas.AwkwardExtensionArray(ragged_array)
>>> df = pd.DataFrame({"column": ak_ext_array})
>>> df
column
0 [0, 1, 2]
1 []
2 [3, 4]
3 [5]
4 [6, 7, 8, 9]
>>> df.to_parquet("/tmp/file.parquet")
and I do (with awkward-pandas
version 2022.12a1):
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/jpivarski/mambaforge/lib/python3.9/site-packages/pandas/core/frame.py", line 2889, in to_parquet
return to_parquet(
File "/home/jpivarski/mambaforge/lib/python3.9/site-packages/pandas/io/parquet.py", line 411, in to_parquet
impl.write(
File "/home/jpivarski/mambaforge/lib/python3.9/site-packages/pandas/io/parquet.py", line 159, in write
table = self.api.Table.from_pandas(df, **from_pandas_kwargs)
File "pyarrow/table.pxi", line 3480, in pyarrow.lib.Table.from_pandas
File "/home/jpivarski/mambaforge/lib/python3.9/site-packages/pyarrow/pandas_compat.py", line 609, in dataframe_to_arrays
arrays = [convert_column(c, f)
File "/home/jpivarski/mambaforge/lib/python3.9/site-packages/pyarrow/pandas_compat.py", line 609, in <listcomp>
arrays = [convert_column(c, f)
File "/home/jpivarski/mambaforge/lib/python3.9/site-packages/pyarrow/pandas_compat.py", line 590, in convert_column
result = pa.array(col, type=type_, from_pandas=True, safe=safe)
File "pyarrow/array.pxi", line 263, in pyarrow.lib.array
File "pyarrow/array.pxi", line 110, in pyarrow.lib._handle_arrow_array_protocol
TypeError: __arrow_array__() got an unexpected keyword argument 'type'
(For the future: including a whole stack trace would remove a lot of guesswork.)
I first thought, "Maybe awkward-pandas
hasn't implemented the __arrow_array__
protocol." But no, the AwkwardExtensionArray
has an __arrow_array__
method:
>>> ak_ext_array.__arrow_array__()
<pyarrow.lib.ChunkedArray object at 0x7ff422d0b9f0>
[
[
[
0,
1,
2
],
[],
...
[
5
],
[
6,
7,
8,
9
]
]
]
Then, "Maybe it has an __arrow_array__
method, but that method doesn't take a type
argument," which is what the error message is saying.
>>> help(ak_ext_array.__arrow_array__)
Help on method __arrow_array__ in module awkward_pandas.array:
__arrow_array__() method of awkward_pandas.array.AwkwardExtensionArray instance
Aha! That's it! So I was about to write an issue on awkward-pandas
, and in so doing, point out the function definition that's missing a type
argument. But the function definition isn't missing a type
argument.
It's just that my copy of the package was old. This is an old bug that has since been fixed.
I upgraded my awkward-pandas
and it all works now:
>>> df.to_parquet("/tmp/file.parquet")
(no errors)
>>> ak.from_parquet("/tmp/file.parquet").show()
[{column: [0, 1, 2]},
{column: []},
{column: [3, 4]},
{column: [5]},
{column: [6, 7, 8, 9]}]
(reads back appropriately)