In Apache Arrow, it seems to be possible to do queries that are similar to "group by" in SQL (see their documentation); however, there are not any examples of how to use this. I want to know how to go from an arrow::Table
and for a given column be able to see the count for each distinct value in the column (I know I could just iterate over it manually). If this is the wrong way to do this, let me know, but I still think an example of how to do "group by" in C++ Arrow would be useful, as there are examples for python, but I could not find any examples of this for C++.
For the most flexibility you will want to make and execute a plan:
arrow::compute::Aggregate aggregate;
aggregate.function = "hash_sum"; // The function to apply
aggregate.name = "SUM OF VALUES"; // The default name of the output column
aggregate.options = nullptr; // Custom options (e.g. how to handle null)
aggregate.target = std::vector<arrow::FieldRef>({"values"}); // Which field to aggregate. Some aggregate functions (e.g. covariance)
// may require targetting multiple fields
arrow::compute::Declaration plan = arrow::compute::Declaration::Sequence({
{"table_source", arrow::compute::TableSourceNodeOptions(std::move(sample_table))},
{"aggregate", arrow::compute::AggregateNodeOptions(/*aggregates=*/{aggregate}, /*keys=*/{"keys"})}
});
ARROW_ASSIGN_OR_RAISE(std::shared_ptr<arrow::Table> grouped, arrow::compute::DeclarationToTable(std::move(plan)));
However, if all you want to do is apply a group-by operation, there is also a convenience function:
// aggregate is defined the same as above
ARROW_ASSIGN_OR_RAISE(std::shared_ptr<arrow::Table> grouped,
arrow::compute::TableGroupBy(std::move(sample_table), {std::move(aggregate)}, {"keys"}));
Complete working example (tested on a fairly recent version of main): https://gist.github.com/westonpace/be500030cc268a626af60abb9299b9ae