Search code examples
mongodbsharding

Mongodb: Adding new shards to a cluster added load to the cluster


The circumstance:

Previously, I had three machines:10.10.10.5, 10.10.10.6, and 10.10.10.7

10.10.10.5 runs:

  • Config Database
  • mongoS
  • shard3, shard4 mongod processes (these are primary in their shards)

10.10.10.6 runs:

  • mongoS
  • shard3 shard4 mongod processes (these are secondary in their shards)

10.10.10.7 runs:

  • mongoS
  • shard3, shard4 mongod processes (these are arbiters)

My application connects to the 10.10.10.6 mongoS.

Everything was running well for about one year. Then, 10.5 and 10.6 experienced very heavy load, especially 10.6. The cpu usage and load average were very high, so I planned to add two new machines to the cluster.

I created two shards: shard1 and shard2. The new machine 10.10.10.8 runs:

  • shard1(primary), shard2(secondary)
  • mongoS

The new machine 10.10.10.9 runs:

  • shard1(secondary), shard2(primary)
  • mongoS

To the old member 10.10.10.7 I also added shard1, shard2 arbiters.

The problem is that when I added the two new machines (using the addShard command), about 5 hours later they finished the migration (though I can't make sure), then the 10.10.10.6 host once again had extremely high load, the load average about 90.5 (4 cpus).

Meanwhile there are many writes and reads request from application to 10.10.10.6 mongoS, but rarely any data or no data written to the new two machines. I used iostat to find there were almost no io bytes in the two new machines.

Why is 10.10.10.6 so highly loaded?

Previously even in peak time the highest load was about 30.5

So could you guys kindly advise how to fix the load issues and get the new machines up and running?

Edit: More information about my environment

10.5, 10.6, 10.7, 10.8, 10.9 all have the same resouce:4CPUS,6g Mem,150G diskspace,the netio is optical fiber.

Shard3 datasize=16g and Shard4 datasize 15g.

I am using 1.8.2


Solution

  • Edit: After discussion in chat

    It is expected that there will be some overhead when adding new shards, at least initially. This is because chunk migrations need to take place, and these will use CPU, disk and network I/O. This will add some additional load to your environment.

    If your read preference is set to read from secondaries, the 10.6 server could quickly become overloaded as it tries to keep up with the replication for two replica sets (which will be increased due to chunk migration) and the traffic from the application itself. Potentially this could be reduced by adding more secondaries, but you need to test this in an environment that closely mimics your production environment.

    Adding more shards could potentially help as well, but again you need to test this thoroughly. It would appear that when you added shards previously the chunk migrations did not complete, so the new shards did not help out with the load as much as they should have done. If you are to add shards again in future, make sure that the chunks have finished migrating by checking db.getSiblingDB("config").locks.find({"_id":'balancer'}) and the output of db.printShardingStatus() to see that the number of chunks is equal across all shards.

    Some more general notes:

    • In production it is inadvisable to have only an single config server running. If you lose this single config server, the cluster will become unusable. See more details here and here

    • Generally speaking it is not recommended to run two mongod instances on the same machine. Two processes will compete for resources that they share, and this is especially true when using memory mapped files as MongoDB does.

    • You can determine what queries and processes are causing the most load by using some built in tools. mongostat and mongotop Edit: MongoTop is not available in 1.8.2 are two command line utilities that allow you to track usage of mongodb. Inside the console you can also run db.currentOp() to get more information on the current operation. You can tell what the balancer is doing (whether or not it is currently doing a balancer round) by issuing db.getSiblingDB("config").locks.find({"_id":'balancer'}) from the console.

    • You are running on a very old version of MongoDB. You should plan to update, if not to the latest stable (2.2.0) or most recent (2.0.7) then to the last stable of the branch you are on (1.8.5). There have been numerous fixes and improvements made to the product since the release you are currently using, which will provide many benefits.