Search code examples
c++apache-arrow

Converting Apache Arrow Table to RecordBatch in c++


I would like to obtain a std::shared_ptr<arrow::RecordBatch> from an std::shared_ptr<arrow::Table> as

std::shared_ptr<arrow:Table> table = ...
auto rb = std::RecordBatch::Make(table->schema(), table->num_rows(), table->columns()).ValueorDie();

However the compiler complains that there's no known conversion from 'const vector<shared_ptr<arrow::ChunkedArray>>' to 'vector<shared_ptr<arrow::Array>>' since the table->columns() of course returns vector<shared_ptr<arrow::ChunkedArray>>. I can't seem to convert the arrow::ChunkedArray into arrow::Array. I've poured over the documentation but I can't, for the life of me figure out how to do this.

How do I go about it, and alternatively, is there another way to convert a arrow::Table into an arrow::RecordBatch?


Solution

  • There is a helper method arrow::Table::CombineChunksToBatch which should become available in the 7.0.0 release.

    In the meantime you can do this:

      ARROW_ASSIGN_OR_RAISE(std::shared_ptr<Table> combined, table->CombineChunks(/*Can pass memory_pool here*/));
      std::vector<std::shared_ptr<Array>> arrays;
      for (const auto& column : combined->columns()) {
        arrays.push_back(column->chunk(0));
      }
      std::shared_ptr<RecordBatch> batch = RecordBatch::Make(table->schema(), table->num_rows(), std::move(arrays));
    

    Keep in mind that this is not a zero-copy operation. Each column in a table is going to consist of multiple arrays. When you call arrow::Table::CombineChunks it will need to allocate a new array big enough for all the chunks and then it will have to copy data from each chunk to this new array.

    If at all possible it is generally more performant to either keep the table or operate on it in a streaming fashion (e.g. use a arrow::TableBatchReader and operate on one batch at a time).