python mysql mongodb pymongo data-migration

MySQL to MongoDB Data migration

We know that MongoDB has two ways of modeling relationships between relations/entities, namely, embedding and referencing (see difference here). Let's say we have a USER database with two tables in mySQL named user and address. An embedded MongoDB document might look like this:

{
  "_id": 1,
  "name": "Ashley Peacock",
  "addresses": [
    {
      "address_line_1": "10 Downing Street",
      "address_line_2": "Westminster",
      "city": "London",
      "postal_code": "SW1A 2AA"
    },
    {
      "address_line_1": "221B Baker Street",
      "address_line_2": "Marylebone",
      "city": "London",
      "postal_code": "NW1 6XE"
    }
  ]
}

Whereas in a referenced relation, 2 SQL tables will make 2 collections in MongoDB which can be migrated by this apporoach using pymongo.

How can we directly migrate MySQL data as an embedded document using python?

Insights about about Pseudo code and performance of algorithm will be highly useful. Something that comes to my mind is creating views by performing joins in MySQL. But in that case we will not be having the structure of children document inside a parent document.

Solution

Denormalization

First, for canonical reference, the question of "embedded" vs. "referenced" data is called denormalization. Mongo has a guide describing when you should denormalize. Knowing when and how to denormalize is a very common hang-up when moving from SQL to NoSQL and getting it wrong can erase any performance benefits you might be looking for. I'll assume you've got this figured out since you seem set on using an embedded approach.

MySQL to Mongo

Mongo has a great Python tutorial you may want to refer to. First join your user and address tables. It will look something like this:

| _id    | name           | address_line_1        | address_line_2 | ... 
| 1      | Ashley Peacock | 10 Downing Street ... | ...
| 1      | Ashley Peacock | 221B Baker Street ... | ...
| 2      | Bob Jensen     | 343 Main Street ...   | ...
| 2      | Bob Jensen     | 1223 Some Ave ...     | ...
...

Then iterate over the rows to create your documents and pass them to pymongo insert_one. Using upsert=True with insert_one will insert a new document if a matching one is not found in the database, or update an existing document if it is found. Using $push appends the address data to the array field addresses in the document. With this setup, insert_one will automatically handle duplicates and append addresses based on matching _id fields. See the docs for more details:

from pymongo import MongoClient

client = MongoClient(port=27017)
db = client.my_db

sql_data = []  # should have your SQL table data
# depending on how you got this into python, you will index with a  
# field name or a number, e.g. row["id"] or row[0] 

for row in sql_data:
    address = {
        "address_line_1": row["address_line_1"],
        "address_line_2": row["address_line_2"],
        "city": row["city"],
        "postal_code": row["postal_code"],
    }
    db.users.update_one(
        {"_id": row["_id"]},
        {"name": row["name"], "$push": {"addresses": address}},
        upsert=True,
    )