Search code examples
pandasnumpypyarrow

Unexpected behaviour when calling `.values` on `bool[pyarrow]` typed `pandas.DataFrame`s


I have a situation where I have a bool-type dataframe, but some of them are constructed like

df1 = pd.DataFrame(True, index=[0], columns=[0], dtype='bool[pyarrow]')

and others are

df2 = pd.DataFrame(True, index=[0], columns=[0], dtype='bool')

When I call .values I get different behaviour:

df1.values.dtype
> dtype('O')

df2.values.dtype
> dtype('bool')

I get similar behaviour for dtype='float[pyarrow]' vs dtype='float'. I just confirmed this is still true for pandas 2.1.4.

I am wondering if this is expected behaviour, or something that will eventually be fixed while pandas and pyarrow are getting more tightly integrated? It seems odd that on the one hand the work to construct the numpy array is done, but then the type is not filled in in what I would have considered a more consistent / efficient manner?


Solution

  • I can't point to the exact change, but this is fixed in the more recent pandas==2.2.2 and pyarrow==16.1.0