I thought I had it with rethinkdb :) but now I'm a bit confused - for this query, counting grouped data:
groupedRql.count()
I'm getting the expected results (numbers):
[{"group": "a", "reduction": 41}, {"group": "b", "reduction": 39}...]
all reduction results are ~40 which is expected (and correct), but when I count using reduce like this:
groupedRql.map(function(row) {
return row.merge({
count: 0
})
}).reduce(function(left, right) {
return {count: left("count").add(1)}
})
I'm getting much lower results (~10) which MAKE NO SENSE:
[{"group": "a", "reduction": 10}, {"group": "b", "reduction": 9}...]
I need to use reduce, of course, for further manipulation. Am I missing something?
I'm using v2.0.3 on server, queries tested directly on the dataexplorer.
The problem lay in here
return {count: left("count").add(1)}
It should be
return {count: left("count").add(right("count"))}
The reduce run paralel between multiple shards, multiple CPU core. When you do
return {count: left("count").add(1)}
you ignore some count from the right
.
It's noted in this document: https://www.rethinkdb.com/docs/map-reduce/#how-gmr-queries-are-executed
it’s important to keep in mind that the reduce function is not called on the elements of its input stream from left to right. It’s called on either the elements of the stream in any order or on the output of previous calls to the function.