I am processing a stream of text data where I don't know ahead of time what the distribution of its values are, but I know each one looks like this:
{
"datetime": "1986-11-03T08:30:00-07:00",
"word": "wordA",
"value": "someValue"
}
I'm trying to bucket it into RethinkDB objects based on it's value, where the objects look like the following:
{
"bucketId": "1",
"bucketValues": {
"wordA": [
{"datetime": "1986-11-03T08:30:00-07:00"},
{"datetime": "1986-11-03T08:30:00-07:00"}
],
"wordB": [
{"datetime": "1986-11-03T08:30:00-07:00"},
{"datetime": "1986-11-03T08:30:00-07:00"}
]
}
}
The purpose is to eventually count the number of occurrences for each word in each bucket.
Since I'm dealing with about a million buckets, and have no knowledge of the words ahead of time, the plan is to create this objects on the fly. I am new to RethinkDB, however, and I have tried my best to do this in such a way that I don't attempt to add a word
key to a bucket that doesn't exist yet, but I am not entirely sure if I'm following best-practice here chaining the commands as follows (note that I am running this on a Node.js server using :
var bucketId = "someId";
var word = "someWordValue"
r.do(r.table("buckets").get(bucketId), function(result) {
return r.branch(
// If the bucket doesn't exist
result.eq(null),
// Create it
r.table("buckets").insert({
"id": bucketId,
"bucketValues" : {}
}),
// Else do nothing
"Bucket already exists"
);
})
.run()
.then(function(result) {
console.log(result);
r.table("buckets").get(bucketId)
.do(function(bucket) {
return r.branch(
// if the word already exists
bucket("bucketValues").keys().contains(word),
// Just append to it (code not implemented yet)
"Word already exists",
// Else create the word and append it
r.table("buckets").get(bucketId).update(
{"bucketValues": r.object(word, [/*Put the timestamp here*/])}
)
);
})
.run()
.then(function(result) {
console.log(result);
});
});
Do I need to execute run here twice, or am I way off base on how you're supposed to properly chain things together with RethinkDB? I just want to make sure I'm not doing this the wrong/hard way before I get much deeper into this.
You don't have to execute run
multiple times, depend on what you want. Basically, a run()
end the chain and send query to server. So we do all the thing to build the query, and end it with run()
to execute it. If you use run()
two times, that means it is send to server 2 times.
So if we can do all processing using only RethinkDB function, we need to call run only one time. However, if we want to some kind of post-processing data, using client side, then we have no choice. Usually I tried to do all processing using RethinkDB: with control structure, looping, and anonymous function we can go pretty far without letting client do some logic.
In your case, the query can be rewritten with NodeJS, using official driver:
var r = require('rethinkdb')
var bucketId = "someId2";
var word = "someWordValue2";
r.connect()
.then((conn) => {
r.table("buckets").insert({
"id": bucketId,
"bucketValues" : {}
})
.do((result) => {
// We don't care about result at all
// We just want to ensure it's there
return r.table('buckets').get(bucketId)
.update(function(bucket) {
return {
'bucketValues': r.object(
word,
bucket('bucketValues')(word).default([])
.append(r.now()))
}
})
})
.run(conn)
.then((result) => { conn.close() })
})