Search code examples
mapreduceaccumulo

Using Accumulo's iterators and combiners to aggregate values from multiple rows


I wanted to know if it is possible to perform aggregation operations on values stored in multiple rows. For example, I have the following table

rowID   colFam   colQual   value
00000   0000     A         12
00000   0001     B         Test
00001   0000     A         35
00001   0001     B         Foo
00002   0000     A         7
00002   0001     B         Bar

What I am trying to do is find the average of all values stored with columnQualifier A. Is it possible using Accumulo's Iterators, Filters or Combiners?

I saw the StatsCombiner, but that combiner performs aggregation on different versions (rowID, colFam and colQual is the same but timestamp is different) of the same key instead of performing aggregation on distinct keys itself.


Solution

  • Combiners (and their predecessors, Aggregators), do aggregation for the same key. You can create an iterator which transforms multiple keys into a single key, but you'll still have to aggregate in the client, because you'll have a bunch of partial computations being produced for each tablet.

    You could use Apache Fluo's "observers" to keep aggregate your stats while you ingest to your table.

    There's probably multiple solutions. I would suggest taking a look at Apache Fluo, and if you really don't want to use that, then consider aggregating partial sums/counts as an iterator in each tablet, and doing the final aggregation on the client side.