Search code examples
amazon-s3storagedistributed-computingdistributed-system

Need for metadata store while storing an object


While checking out the design of a service like pastebin, I noticed the usage of two different storage systems:

  1. An object store(such as Amazon S3) for storing the actual "paste" data
  2. A metadata store to store other things pertaining to that "paste" data; such as - URL Hash(to access that paste data), Reference to the actual paste data etc.

I am trying to understand the need for this metadata store.

Is this generally the recommended way? Any specific advantage we get from using the metadata store?

Do object storage systems NOT allow metadata to be stored along with the actual object in the same storage server?


Solution

  • Object storage systems generally do allow quite a lot of metadata to be attached to the object.

    But then your metadata is at the mercy of the object store.

    • Your metadata search is limited to what the object store allows.
    • Analysis, notification (a-la inotify) etc. are at limited to what the object store allows.
    • If you wanted to move from S3 to Google Cloud Storage, or to do both, you'd have to normalize your metadata.
    • Your metadata size limitations are limited to that of the object store.
    • You can't do cross-object-store metadata (e.g. a link that refers to multiple paste data).
    • You might not be able to have binary metdata.

    Typically, metadata is both very important, and very heavily used by the business, so it has separate usage characteristics than the data, so it makes sense to put it on storage with different characteristics.

    I can't find anywhere how pastebin.com makes money, so I don't know how heavily they use metadata, but merely the lookup, the translation between URL and paste data, is not something you can do securely with object storage alone.