MongoDB BI Architecture

We've have a production app running Mongo with a replica set on different box.

I'd like to start doing some BI on the data, possibly using Pentaho.

My question is: how should I setup my architecture such that I'm not doing BI directly on the production environment?

Should I create a separate BI instance and do an mongoexport to the BI instance or is there some other best-practice I should follow?

Solution

There are several options to consider depending on your data set, BI requirements, and MongoDB server version. If you just need to read data for reports, there are more options than if you want to write data as well (eg. for a map/reduce operation). MongoDB 2.2 also introduces some very useful features and improvements as noted below.

In general, use of a replica set configuration will be extremely helpful for administrative purposes as this enables a full copy of your data set to be available without disrupting your primary MongoDB server. For larger data sets and horizontal write scaling, MongoDB's sharding feature can also be used in conjunction with any of the suggestions below.

Before going down the path of separating your BI data, it would be worth determining what the actual impact is by testing in a staging environment.

The following approaches are roughly in order of isolation from your production environment:

With replica sets you can use read preferences to direct queries to appropriate servers. In versions of MongoDB prior to 2.2 the general read preferences were limited to reading from a primary or allowing reads from non-hidden secondaries with "slaveOK" (equivalent to "secondaryPreferred"). In MongoDB 2.2 there are some additional read preferences including "secondary" (read from secondary if available, otherwise error); "primary preferred" (read from primary if available .. otherwise a secondary); and "nearest" (read from nearest primary or secondary node based on network latency). The read preferences in MongoDB 2.2 can be used in conjunction with tag sets to provide even more granular control over directing queries to servers in a replica set or sharded cluster.
For MongoDB 1.8 and higher, you can use replica sets with a hidden secondary node. A hidden node will not be advertised to clients connecting to the replica set normally, but can be connected to directly for report generation. Note: the hidden node will be a read-only secondary, so this limits the use of some queries. For example, map/reduce requires write access to save to an output collection .. but you could use an inline map/reduce depending on your BI requirements.
MongoDB 2.2 has a database-level write lock (an improvement from prior versions that had a global write lock). If you need to write BI data, you can save this into a separate database to minimize lock contention. You still have to consider the overall resource effect .. for example, processing a significant number of older documents for BI purposes might compete with the caching of latest documents that are being queried by your production application.
If you want to completely separate BI data from the production environment, you could create a separate instance using one of the MongoDB backup strategies. If you have replication enabled, you can create a backup from a secondary in your replica set. Depending on the size of your data set, it will likely be faster to do a snapshot copy of the data (which includes indexes that are already built) rather than a full mongodump/mongorestore cycle.