Search code examples
mongodbpymongo

Index a collection in Mongodb having array of objects for fast update query


I have a document which looks like this:

{ name:"xx",
  surname:"yy",
  job_history:[{company:"aa",salary:"bb",experience:"cc",ip_address:"xx.xx.xx.xx"},{company:"aa",salary:"bb",experience:"cc",ip_address:"xx.xx.xx.xx"}]
}

The size of collection is 13.5GB and i need to perform update on ip_address. I have all the ip_address in excel sheet original and replacement values. I tried to index the job_history.ip_address but the following index is not used for the update query some how. The update is taking long time to perform single update(around 2 minutes). I have tried bulk_write() and it is taking time as well. Any suggestion how to perform an efficient update on such collection or how to index for efficient update. To give more information on dataset. There are 1.5 million documents in the collection. The structure of each document is exactly the same as above. The job_history array has either 1 or 2 objects of the exactly same structure. My question is How can i speedy up this update ? Indexing which fields will actually speedy up the update ?


Solution

  • 13.5GB is not a big collection for mongoDB could be some storage issue , few questions maybe of help as well as:

    1. How many replicaSet members you have and are they near by or deployed in different geographical areas?

    2. What writeConcern you use for the update query?

    3. How much RAM & CPU your member has , do the indexes fit in memory?

    4. Is the database under heavy utilisation ( how many reads/writes per second ) ?

    5. What is the mongodb version and storage engine?

    Check this steps:

    1. mongoimport your csv/xls to a new temporary collection "ipchange" with following document schema:

      { _id:1, ipold:"x" ,ipnew:"y"}
      

      example mpngoimport command:

       mongoimport -d <yourDB> -c ipchange --type csv --file ipchange.csv --headerline
      

      example ipchange.csv file:

        ipold,ipnew
        19.16.1.2,192.168.1.2
        1.2.3.4,192.168.1.3
      
    2. Create index for your targeted collection on "job_history.ip_address" field:

      db.createIndex({"job_history.ip_address" :1 })
      
    3. Execute the following query:

       db.ipchange.find({}).forEach(function(doc){ 
       db.collection.update({
       "job_history.ip_address": doc.ipold
       },
       {
        $set: {
         "job_history.$[x].ip_address": doc.ipnew
        }
        },
        {
        arrayFilters: [
           {
            "x.ip_address": doc.ipold
            }
          ],
           multi: true,
           writeConcern: 0,
           wtimeout: 30000,    // 30sec max
           j: false
         })
      
       })
      

    Please, note writeConcern:0 & journaling: false here set for speed , but if you need write confirmation you can change to "majority" afcourse sacrifice the speed ...

    1. When you are sure operation was executed successfully you can drop the temporary collection with:

       db.ipchange.drop()
      

    Explained:

    1. You load the xls/csv file to temporary collection.
    2. Create index so the old IPs to be identified faster.
    3. Execute forEach on the ipchange collection and update your targeted collection with the necessary changes , please, note the multi:true will update all documents where the old IP is found with the new IP from the list.

    One more thing: You can confirm if the index is used via explain:

         db.collection.explain("executionStats").update(...)
    

    In the output you will see something like:

                            "inputStage" : {
                                "stage" : "IXSCAN"
    

    instead of:

                        "inputStage" : {
                            "stage" : "COLLSCAN"
    

    update query playground

    Summary: Afcourse you can add some paralelisation with bulk_write/ordered:false in some multi-threading script and split input data in ranges per bulk , but the important task here is to use the index in your update query part so you speed up the individual queries.