Search code examples
apache-arrowapache-arrow-flight

Apache Arrow Flight: Getting sorted data from multiple endpoints


According to the document (https://arrow.apache.org/docs/dev/format/Flight.html), an Apache Arrow Flight client cannot get sorted data from multiple endpoints. It seems that this is by design.

In the introduction document (https://arrow.apache.org/blog/2019/10/13/introducing-arrow-flight/), they say "While Flight streams are not necessarily ordered, we provide for application-defined metadata which can be used to serialize ordering information.". But I think the application-defined metadata is not very useful since a general client (like a BI application) that uses a wrapper - for example, Apache Arrow Flight SQL, let alone a wrapper of wrapper: Apache Arrow Flight SQL JDBC driver - does not know it.

Is there any standard way to get sorted data from multiple Apache Arrow Flight endpoints? If not, why did the designers choose not to support that feature?

Thanks.


Solution

  • It was not considered at the time, but you are right: it would be useful to have a way to indicate this so that various wrappers and projects building on top have a standardized way to know how to handle this.

    The main idea is that if data is sorted, you should return a single endpoint. I believe the reasoning was that it would be rare to have an implementation capable of doing sorting across multiple endpoints, since that would be expensive to implement. Of course, that isn't very useful if your backend can actually sort data across multiple workers!

    I (as one of the contributors to the project) am planning to put up a proposal to handle this case. If you are interested, please keep a watch on the mailing list: [email protected].