Search code examples
pandasapache-arrowfeathervaex

Arrow IPC vs Feather


What is the difference between Arrow IPC and Feather?

The official Arrow documentation has this to say about Feather:

Version 2 (V2), the default version, which is exactly represented as the Arrow IPC file format on disk. V2 files support storing all Arrow data types as well as compression with LZ4 or ZSTD. V2 was first made available in Apache Arrow 0.17.0.

While vaex, which is an alternative to pandas, has two different functions, one for Arrow IPC and one for Feather. polars, another pandas alternative, indicate that Arrow IPC and Feather are the same.


Solution

  • TL;DR There is no difference between the Arrow IPC file format and Feather V2.

    There's some confusion because of the two versions of Feather, and because of the Arrow IPC file format vs the Arrow IPC stream format.

    For the two versions of Feather, see the FAQ entry:

    What about the “Feather” file format?

    The Feather v1 format was a simplified custom container for writing a subset of the Arrow format to disk prior to the development of the Arrow IPC file format. “Feather version 2” is now exactly the Arrow IPC file format and we have retained the “Feather” name and APIs for backwards compatibility.

    So IPC == Feather(V2). Some places refer to Feather mean Feather(V1) which is different from the IPC file format. However, that doesn't seem to be the issue here: Polars and Vaex appear to use Feather to mean Feather(V2) (though Vaex slightly misleadingly says "Feather is exactly represented as the Arrow IPC file format on disk, but also support compression").

    Vaex exposes both export_arrow and export_feather. This relates to another point of Arrow, as it defines both an IPC stream format and an IPC file format. They differ in that the file format has a magic string (for file identification) and a footer (to support random access reads) (documentation).

    export_feather always writes the IPC file format (==FeatherV2), while export_arrow lets you choose between the IPC file format and the IPC stream format. Looking at where export_feather was added I think the confusion might stem from the PyArrow APIs making it obvious how to enable compression with the Feather API methods (which are a user-friendly convenience) but not with the IPC file writer (which is what export_arrow uses). But ultimately, the format being written is the same.