Search code examples
tensorpyarrowapache-arrow

What are the aims for the Apache Arrow Tensor Extension types?


Apache Arrow (Columnar Format) is a "standard and efficient in-memory representation of various datatypes".

DLPack is an "open in-memory tensor structure for sharing tensors among frameworks".

Both Arrow and DLPack aim to provide a common standard that can simplify the exchange of data between major frameworks (eg pandas and polars, tensorflow and pytorch). I previously viewed Arrow as offering a standard for Tabular/Dataframe-based libraries, and DLPack as offering a standard for Array/Tensor-based libraries.

But with the relatively recent release of the Arrow Fixed Shape Tensor extension type, it seems like Arrow and DLPack are now overlapping in a more substantial way.

Could anyone provide a bit more information about the longer-term vision of the Tensor extension types within the Arrow project, and how they are intended to overlap with other standards like DLPack? Is there not a risk of creating multiple competing standards here (like the famous XKCD cartoon)?


Solution

  • It is true that both Apache Arrow and DLPack are designed to simplify the exchange of data between major frameworks but they are also quite different in scope, design, use cases etc.

    Apache Arrow is at its base an in-memory columnar format built with the aim to be a standard for tabular data with broad adaptation where libraries can benefit from zero-copy reads and fast data access. See some good content on the topic in The Composable Codex.

    Fixed shape and variable shape tensor extension type (which is in the process of implementation) are simply existing Arrow data types with extra metadata (custom semantics like shape, permutation etc.) specific to applications that work with tensors. Applications are able to construct Arrow Tables/RecordBatches with columns of specific extension type, for example columns with tensors as values. Then they can take use of the custom semantics added to the extension type when dealing with Arrow data in their pipeline. In our case they can work with and transfer data together with the information like the shape of a tensor.

    DLPack is much smaller in scope and is focused on applications that work with tensors/arrays and use different data formats (if two applications share Apache Arrow data then no protocol is needed). DLPack was selected as a Python array API standard by the Consortium for Python Data API Standards and this is how I understand it: array or tensor interchange protocol.

    Also important thing to note is that Apache Arrow supports missing values via validity bitmaps and it also supports nested and logical types, see Arrow Columnar Format - Terminology which is an important distinction (from DLPack or numpy) to have in mind.

    There are also device support differences, but I guess we are talking about CPU at the moment.

    Just to make things more interesting, there is also a Dataframe Interchange Protocol for libraries that work with tabular data, see Python dataframe API. This protocol is closer to Apache Arrow than DLPack is but it is also "only" a protocol not a comprehensive standard as Arrow is. Again, I would direct you to The Composable Codex for a nice overview of Apache Arrow.

    To end with revisiting your questions:

    Both Arrow and DLPack aim to provide a common standard that can simplify the exchange of data between major frameworks

    Yes, but with different scope and use-cases.

    Arrow as offering a standard for Tabular/Dataframe-based libraries, and DLPack as offering a standard for Array/Tensor-based libraries.

    This is a very rough idea and can be improved, hope I managed to do that with the introduction above?

    But with the relatively recent release of the Arrow Fixed Shape Tensor extension type, it seems like Arrow and DLPack are now overlapping in a more substantial way.

    DLPack provides a bridge between tensor/ndarray libraries not sharing the same storage format. Canonical extension types in Apache Arrow can be used by libraries already using Apache Arrow format that would benefit from extra metadata.

    Could anyone provide a bit more information about the longer-term vision of the Tensor extension types within the Arrow project,

    Not sure there is any long term vision. As added in previous point, the need for such a canonical extension type has surfaced as multiple libraries had a need for it. Extension type can be suggested to be added to Apache Arrow canonical extensions so that interoperability between different systems already integrating Arrow columnar data can be improved. See https://arrow.apache.org/docs/format/CanonicalExtensions.html

    and how they are intended to overlap with other standards like DLPack?

    They are similar, but different =) There is work going on to implement __dlpack__ method in PyArrow also. This would enable libraries that do not support Apache Arrow to be able to consume the Arrow data via __dlpack__ protocol =)

    Is there not a risk of creating multiple competing standards here (like the famous XKCD cartoon)?

    I do not think so. As mentioned, Apache Arrow is very comprehensive but also technically challenging to implement and support. The protocols mentioned in this thread could be made redundant if everybody would adapt Apache Arrow :) At the same time they give the possibility for specific use cases to take advantage of interoperability and simplify the exchange.

    Hope this helps. In case I have made any errors in my understanding or summary I apologize. Adding some more links to the official documentation for clarification: