Search code examples
c++apache-arrowapache-arrow-cpp

How do you compute Grouped Aggregations in Apache Arrow in C++


In Apache Arrow, it seems to be possible to do queries that are similar to "group by" in SQL (see their documentation); however, there are not any examples of how to use this. I want to know how to go from an arrow::Table and for a given column be able to see the count for each distinct value in the column (I know I could just iterate over it manually). If this is the wrong way to do this, let me know, but I still think an example of how to do "group by" in C++ Arrow would be useful, as there are examples for python, but I could not find any examples of this for C++.


Solution

  • For the most flexibility you will want to make and execute a plan:

    arrow::compute::Aggregate aggregate;
    aggregate.function = "hash_sum";                             // The function to apply
    aggregate.name = "SUM OF VALUES";                            // The default name of the output column
    aggregate.options = nullptr;                                 // Custom options (e.g. how to handle null)
    aggregate.target = std::vector<arrow::FieldRef>({"values"}); // Which field to aggregate.  Some aggregate functions (e.g. covariance)
                                                                 // may require targetting multiple fields
    arrow::compute::Declaration plan = arrow::compute::Declaration::Sequence({
      {"table_source", arrow::compute::TableSourceNodeOptions(std::move(sample_table))},
      {"aggregate", arrow::compute::AggregateNodeOptions(/*aggregates=*/{aggregate}, /*keys=*/{"keys"})}
    });
    
    ARROW_ASSIGN_OR_RAISE(std::shared_ptr<arrow::Table> grouped, arrow::compute::DeclarationToTable(std::move(plan)));
    

    However, if all you want to do is apply a group-by operation, there is also a convenience function:

    // aggregate is defined the same as above
    ARROW_ASSIGN_OR_RAISE(std::shared_ptr<arrow::Table> grouped,
                          arrow::compute::TableGroupBy(std::move(sample_table), {std::move(aggregate)}, {"keys"}));
    

    Complete working example (tested on a fairly recent version of main): https://gist.github.com/westonpace/be500030cc268a626af60abb9299b9ae