mongodb postgresql cassandra nosql distributed-system

NoSQL (Cassandra/Mongo) vs RDBMS

Hi I'm learning the feature of NoSQL database from system design perspective and also read that a lot of big companies use a sharded RDBMS instead of those nosql databases to save their data.

Does this make that the only advantage of NoSQL data (Cassandra/MongoDB) is because it's an off-the shelf distributed solution and cheap to maintain?

Solution

read that a lot of big companies use shard rmdb

In my last job, my org supported manually-sharded Postgres solution. It was a source of tremendous pain for us, as it was difficult to manage and maintain due to its size. Remember that RDBMSs weren't really designed to work that way.

only advantage of nosql ... is because it's an off-the shelf distributed solution

The decision is really all about tradeoffs. When your data workloads cannot be handled by a single DB instance or you require uptime without a single point of failure, NoSQL can help you. Databases that sacrifice consistency for partition tolerance and availability ("AP" databases) are often found to be able to process large workloads with little latency because of their ability to spread the data (and thus, the queries) to multiple server instances.

Also, if your data needs to be geographically or data center aware, you'll want a database which supports that. Trying to make database products work in ways that their original design did not account for, is a recipe for pain.

cheap to maintain

Ask anyone who runs an enterprise Database organization, and they will tell you that NoSQL is not cheap (or easy) to maintain. Sure, you might be getting an open source product that you don't have to "buy," but you're going to need (often highly-paid) database engineers to maintain it.

scalability comes along with nosql by design, but it also seems that sql can also achieve the same and the primary issue is only the maintenance/configuration when scaling it up.

It depends. How big are you planning to scale to, and how many data replicas would need to be supported? Large enterprises like Apple have thousands of servers running Apache Cassandra. They do that because iCloud needs to scale to support the needs of 900 million iPhone users. They can easily add (scale up) or remove (scale down) based on their needs for compute resources.

Achieving that level of scalability with a relational database requires a LOT more work than it does with Cassandra (NoSQL). And when you find out that you need to scale up even more, you're basically looking at a data reload scenario (because shard sizes essentially change) to get the data to the new instances. A database team will reach the point (very quickly) where the amount of work it takes to scale out a RDBMS is impractical.

The other aspect, is that how does the client application know which of the relational database servers to send the query? For most relational databases, you'll end up having to build out or augment that logic layer somehow. And when the number of database instances changes, you'll need the application to know about that, too. NoSQL databases account for node discovery, and (most) abstract that so the client application doesn't need to worry about it.

Also remember that not all NoSQL databases are created equally. On some products, only certain nodes will accept writes. On some products, any node can handle a read or a write. Relational databases don't have any concept of that, so you would have to account for that as well.

tl;dr;

It's much more complicated than only maintenance due to scaling. If it wasn't, every major relational database would have a simple way to handle that, and NoSQL DBs would be irrelevant. But that hasn't happened.