Search code examples
mongodbmongodb-queryaggregation-framework

Converting csv (";" based) string value of a field to new key:value items in same document in Mongodb aggregate


Sample documents of a collection

DB: MYDB
Collection: MYCOLL
{
    "_id": {
      "$oid": "678a78375cb6955814197272"
    },
    "ID": 1019397,
    "INFO": "A=5242;AF=987;C=82622"
}
{
    "_id": {
      "$oid": "678a78375cb6955814197272"
    },
    "ID": 1019397,
    "INFO": "A=h242;AF=9y87;C=8w622"
}

I would like to have expected output as following:

{
    "_id": {
        "$oid": "678a78375cb6955814197272"
      },
      "ID": 1019397,
      "INFO": "A=5242;AF=987;C=82622",
      "A":"5242",
      "AF":"987",
      "C":"82622"

}

I have tried to do it myself, but for now I created the following aggregate code for a specific value in CSV. which is not what I want to do. Can you please tell me how can I make it better in performance and create (updating the current collection document) in the form as I showed above?

db["MYCOLL"].aggregate([
    
    {
        $project: {
            INFO_ARR: {
                $split:["$INFO",";"]
            }
        }
    },
    { 
        $project: { 
            AF_ARR: { 
                $arrayElemAt: [ "$INFO_ARR" ,  1 ] }
            }
    },
    { 
        $project: { 
            AF_FREQ_ARR: { 
                $split: [ "$AF_ARR" ,  "=" ]}
            }
            
    },
    {
        $project: { 
            AF: { 
                $arrayElemAt: [ "$AF_FREQ_ARR" ,  1 ] }
            }
    },
    { $merge : { into: { db: "MYDB", coll: "MYCOLL" }, on: "_id",  whenMatched: "merge", whenNotMatched: "insert" } }
          
])

Note: I can not use javascript as the collection is of size 400GB and is having millions of records. Hence, Javascript may be too slow to update all of the records. Am I wrong about this ?


Solution

  • You can use the following aggregation pipeline to split the contents of the INFO property and add the values as properties:

    [
      {
        $project: {
          INFO: 1
        }
      },
      {
        $set: {
          info_arr: {
            "$split": [
              "$INFO",
              ";"
            ]
          }
        }
      },
      {
        $set: {
          info_arr2: {
            "$map": {
              "input": "$info_arr",
              "in": {
                k: {
                  $first: {
                    $split: [
                      "$$this",
                      "="
                    ]
                  }
                },
                v: {
                  $last: {
                    $split: [
                      "$$this",
                      "="
                    ]
                  }
                }
              }
            }
          }
        }
      },
      {
        $set: {
          info_obj: {
            "$arrayToObject": "$info_arr2"
          }
        }
      },
      {
        "$replaceRoot": {
          "newRoot": {
            "$mergeObjects": [
              "$$ROOT",
              "$info_obj"
            ]
          }
        }
      },
      {
        $unset: [
          "info_arr",
          "info_arr2",
          "info_obj"
        ]
      }
    ]
    
    • The pipeline first splits the field content at the semicolons, then splits the individual lines at the equality sign and assigns the key to a k field, the value to a v field.
    • At this point, the $arrayToObject operator is used to create an object from the array contents.
    • Afterwards, the properties of the newly created subdocument are merged with the $$ROOT document.
    • At the end, the temporary properties are removed using an $unset stage.

    At the end, the documents look like this:

    {
        "A": "5242",
        "AF": "987",
        "C": "82622",
        "INFO": "A=5242;AF=987;C=82622",
        "_id": ObjectId("5a934e000102030405000001")
     }
    

    If you add a $merge stage, you can add the new properties to existing documents in the collection.

    See this playground to test.