I am running a sharded mongodb environment - 3 mongod shards, 1 mongod config, 1 mongos (no replication).
I want to use mongoimport to import csv data into the database. I have 105million records stored in increments of 500,000 across 210 csv files. I understand that mongoimport is single threaded and I read that I should run multiple mongoimport processes to get better performance. However, I tried that and didn't get a speed up:
when running 3 mongoimports in parallel, I was getting ~6k inserts/sec per process (so 18k i/s) vs. running 1 mongoimport, I was getting ~20k inserts/sec.
Since these processes were routed through the single mongod config and mongos, I am wondering if this is due to my cluster configuration. My question is, if I set up my cluster configuration differently, will I achieve better mongoimport speeds? Do I want more mongos processes? How many mongoimports processes should I fire off at a time?
So, the first thing you need to do is "pre-split" your chunks.
Let's assume that you have already sharded the collection to which you're importing. When you start "from scratch", all of the data will start going to a single node. As that node fills up, MongoDB will start "splitting" that node into chunks. Once it gets to around 8 chunks (that's about 8x64MB of index space), it will start migrating chunks.
So basically, you're effectively writing to a single node and then that node is being slowed down because it has to read and write its data to the other nodes.
This is why you're not seeing any speedup with 3 mongoimport
. All of the data is still going to a single node and you're maxing out that node's throughput.
The trick here is to "pre-split" the data. In your case, you would probably set it up so that you get about 70 files worth of data on each machine. Then you can import those files on different threads and get better throughput.
Jeremy Zawodny of Craigslist has a reasonable write-up on this here. The MongoDB site has some docs here.