Search code examples
node.jsmongodbexpressmongoosemongoose-schema

Best practices for structuring hierarchical/classified data in mongodb


Summary:

I am building my first large scale full stack application(MERN stack) that is trying to mimic a large scale clothing store. Each article of clothing has many 'tags' that represent its features, top/bottom/accessory/shoes/ect, and subcategories, for example on top there is shirt/outerwear/sweatshirt/etc, and sub-sub-categories within it, for example on shirt there is blouse/t-shirt/etc. Each article has tags for primary colors, hemline, pockets, technical features, the list goes on.

Main question:

how should I best organize the data in mongodb with mongoose schemas in order for it to be quickly searchable when I plan on having 50,000 or more articles? And genuinely curious, how do large clothing retailers typically design databases to be easily searchable by customers when items have so many identifying features?

Things I have tried or thought of:

On the mongoDB website there is a recommendation to use a tree structure with child references. here is the link: https://docs.mongodb.com/manual/tutorial/model-tree-structures-with-child-references/ I like this idea but I read here: https://developer.mongodb.com/article/mongodb-schema-design-best-practices/ that when storing over a few thousand pieces of data, using object ID references is no longer sufficient, and could create issues because of datalimits.

Further, each clothing item would fall into many different parts of the tree. For example it could be a blouse so it would be in the blouse "leaf" of the tree, and then if its blue, it would be in the blue "leaf" of the tree, and if it is sustainably sourced, it would fall into that "leaf" of the tree as well. Considering this, a tree like data structure seems not the right way to go. It would be storing the same ObjectID in many different leaves.

My other idea was to store the article information (description, price, and picture) seperate from the tagging/hierarchical information. Then each tagging object would have a ObjectID reference to the item. This way I could take advantage of the propogate method of mongoose if I wanted to collect that information.

I also created part of the large tree structure as a proof of concept for a design idea I had, and this is only for the front end right now, but this also creates bad searches cause they would look like taxonomy[0].options[0].options[0].options[0].title to get to 'blouse'. Which from my classes doesnt seem like a good way to make the code readable. This is only a snippet of a long long branching object. I was going to try to make this a mongoose schema. But its a lot of work and I wanna make sure that I do it well.

 const taxonomy = [
    {
        title: 'Category',
        selected: false,
        options: [
            {
                title: 'top',
                selected: false,
                options: [
                    {
                        title: 'Shirt',
                        selected: false,
                        options: [
                            {
                                title: 'Blouse',
                                selected: false,
                            },
                            {
                                title: 'polo',
                                selected: false,
                            },
                            {
                                title: 'button down',
                                selected: false,
                            },
                        ],
                    },
                    {
                        title: 'T-Shirt',
                        selected: false,
                    },
                    {
                        title: 'Sweater',
                        selected: false,
                    },
                    {
                        title: 'Sweatshirt and hoodie',
                        selected: false,
                    },
                ],
            },

Moving forward:

I am not looking for a perfect answer, but I am sure that someone has tackled this issue before (all big businesses that sell lots of categorized products have) If someone could just point me in the right direction, for example, give me some terms to google, some articles to read, or some videos to watch, that would be great.

thank you for any direction you can provide.


Solution

  • MongoDB is a document based database. Each record in a collection is a document, and every document should be self-contained (it should contain all information that you need inside it).

    The best practice would be to create one collection for each logical whole that you can think of. This is the best practice when you have documents with a lot of data, because it is scalable.

    For example, you should create Collections for: Products, Subproducts, Categories, Items, Providers, Discounts...

    Now, when you creating Schemas, instead of creating nested structure, you can just store a reference of one collection document as a property of another collection document.

    NOTE: The maximum document size is 16 megabytes.

    BAD PRACTICE

    Let us first see what would be the bad practice. Consider this structure:

    Product = {
      "name": "Product_name",
      "sub_products": [{
          "sub_product_name": "Subpoduct_name_1",
          "sub_product_description": "Description",
          "items": [{
              "item_name": "item_name_1",
              "item_desciption": "Description",
              "discounts": [{
                "discount_name": "Discount_1",
                "percentage": 25
              }]
            },
            {
              "item_name": "item_name_2",
              "item_desciption": "Description",
              "discounts": [{
                "discount_name": "Discount_1",
                "percentage": 25
              },
              {
                "discount_name": "Discount_2",
                "percentage": 50
              }]
            },
          ]
        },
        ...
      ]
    }
    

    Here product document has sub_products property which is an array of sub_products. Each sub_product has items, and each item has discounts. As you can see, because of this nested structure, the maximum document size would be quickly exceeded.

    GOOD PRACTICE

    Consider this structure:

    Product = {
      "name": "Product_name",
      "sub_products": [
         'sub_product_1_id',
         'sub_product_2_id',
         'sub_product_3_id',
         'sub_product_4_id',
         'sub_product_5_id',
         ...
      ]
    }
    
    Subproduct = {
      "id": "sub_product_1_id",
      "sub_product_name": "Subroduct_name",
      "sub_product_description": "Description",
      "items": [
         'item_1_id',
         'item_2_id',
         'item_3_id',
         'item_4_id',
         'item_5_id',
         ...
      ]
    }
    
    Item = {
        "id": "item_1_id",
      "item_name": "item_name_1",
      "item_desciption": "Description",
      "items": [
         'discount_1_id',
         'discount_2_id',
         'discount_3_id',
         'discount_4_id',
         'discount_5_id',
         ...
      ]
    }
    
    Discount = {
      "id": "discount_1_id",
      "discount_name": "Discount_1",
      "percentage": 25
    }
    

    Now, you have collection for each logical whole and you are just storing a reference of one collection document as a property of another collection document.

    Now you can use one of the best features of the Mongoose that is called population. If you store a reference of one collection document as a property of another collection document, when performing querying of the database, Mongoose will replace references with the actual documents.