Search code examples
database-designnosqlrdbmscapacity-planning

How can I calculate database design storage costs?


I often have a couple different schema's in mind when starting project. After making rough guesses I realize that some are less optimized for growth or storage space than others. Obviously, the size of the column value is the main thing. But table metadata, indexes, and row headers all play a part as well.

In addition, RDBMS use a completely different approach to data storage than object or key-value databases.

What are some good resources for trying to figure out the cost (or room needed) for database storage?

Note, my question has little to do with choosing the database, but rather knowing how to properly make use of each database's design for the most efficiently. Databases like PostgreSQL, MySQL, CouchDB, all have different target use cases and multiple ways to solve the same problem. So knowing the storage cost of each solution will help add to the choice of the best solution for the schema.


Solution

  • RDBMS use a completely different approach to data storage than object or key-value databases.

    The relational model assumes you don't know what data will be needed in the future, or how data will be accessed in the future. This has proven to be a pretty reliable assumption in my experience.

    That's one reason a SQL dbms will let you add indexes as they're needed, and let you drop indexes that have proven useless. It will let you add constraints as they become known--constraints that sometimes require adding more tables--and drop constraints as the requirements change. It will let you add columns as you discover more things that would be good to know. It will let you replace tables with views and replace views with tables. Some dbms will let you create materialized views--their impact on query speed can be dramatic, and their impact on disk usage, devastating.

    Useful databases extend their reach. A SQL database, designed according to the relational model, makes it relatively easy to add features nobody dreamed of during the initial design, and without crushing other parts of the system. So they're called often called upon to do things their initial designers didn't imagine.

    All of these things

    • adding and dropping indexes over time,
    • adding and dropping constraints over time,
    • adding and dropping columns over time,
    • adding and dropping tables over time,

    make any estimate of disk usage look like a waste of time. Any one of them alone can drastically change the disk space required for a database.

    You can calculate the space required by a row and a page fairly accurately. (Try Google for "YourDBMSname row layout" and "YourDBMSname page layout".) But when you try to multiply by the number of rows required you have to estimate the number of rows. That puts you at the big end of what Steve McConnell calls "the cone of uncertainty".

    If you haven't measured disk usage in multiple projects over time at your own company, estimating the impact of those bullet points above is just guessing.

    The last Fortune 100 company I worked for had an operational database that had been in production since the 1970s. Hundreds of applications, written in more than 25 programming languages over the course of 40 years hit that thing every day. (I think it was built on IBM's IMS originally; today it runs on Oracle.)

    Even just a few years ago, nobody there imagined that their database would be used to translate engineering drawings and bills of materials into Chinese, and also to produce the customs documents they'd need to get finished products out of China. Implementing those new features required storing additional data about every part and about every design document in their live inventory. Early in that project, our estimates were pretty far off. That's the big end of the cone. (We estimated several things, but not disk usage. We were required to succeed, so whatever design I came up with, somebody would be required to supply the needed disk space.) But when we went live, we knew the exact value for every estimate, because we'd already done the work. (That's the narrow end of the cone.)

    So, how do you mitigate the risk of guesswork in a database design and deployment environment? Take a lesson from 1972.

    Build a prototype, and measure it.

    Chemical engineers learned long ago that a process that works in the laboratory cannot be implemented in a factory in only one step. An intermediate step called the pilot plant is necessary to give experience in scaling quantities up and in operating in nonprotective environments. . . .

    . . . Project after project designs a set of algorithms and then plunges into construction of customer-deliverable software on a schedule that demands delivery of the first thing built. . . .

    The management question, therefore, is not whether to build a pilot system and throw it away. You will do that. The only question is whether to plan in advance to build a throwaway, or to promise to deliver the throwaway to customers.

    Fred Brooks, Jr., in The Mythical Man-Month, p 116.