Search code examples
pandaspyarrow

pyarrow.lib.ArrowCapacityError when creating string


I'd like to create a new string column, but pandas using pyarrow backend throws an ArrowCapacityError: array cannot contain more than 2147483646 bytes, have 3525828799

    df["edge_id"] = df.knot.shift(1).fillna("start") + "-" + df.knot

My df has a size of several GB. Is it to large for pyarrow backend? The same works with numpy backend. "knot" is a str / str[pyarrow] column. Server has enough free RAM.


Solution

  • The issue you are running into is related to https://github.com/pandas-dev/pandas/issues/56259, and under the hood caused by a limitation of the pyarrow string type that it can only hold around 2GB of data in a single chunk. Typically this is solved by having an array with multiple chunks under the hood, or by using the large_string type. However, there are several situations where pyarrow doesn't chunk automatically when it would be needed (and manually chunking is possible in pyarrow, but that's not really exposed in the pandas API).

    If you are using the new (pyarrow-backed) string dtype that will become the default with pandas 3.0 (enabled with pd.options.future.infer_string = True), then this will be fixed in the upcoming pandas 2.2 release.

    If you are using the ArrowDtype("string") (from using dtypes_backend="pyarrow" in some read or constructor API), then one option is to cast your column to pd.ArrowDtype(pa.large_string()).