MYISAM sharding vs using InnoDB

I have a table with very high insert rate and update rate as well as read rate. On average there are about 100 rows being inserted and updated per second. And there are about 1000 selects per second.

The table has about 100 million tuples. It is a relationship table so it only has about 5 fields. Three fields contain keys so they are indexed. All the fields are of integers.

I am thinking of sharding the data, however, it adds a lot of complexity, but does offer speed. The other alternative is to use innodb.

The database runs on a raid 1 of 256GB ssd with 32GB 1600mhz of RAM and i7 3770k over clocked to 4Ghz

The database freezes constantly at peak times where the queries can be as high as 200 rows being inserted or updated and 2500 selects per second

Could you guys please point into as what I should do?

Solution

Sharding is usually a good idea to distribute table size. Load problems should generally be addressed with a replicated data environment. In your case your problem is a) huge table and b) table level locking and c) crappy hardware.

InnoDB

If you can use one of the keys on your table as a primary key, InnoDB might be a good way to go since he'll let you do row-level locking which may reduce your queries from waiting on each other. A good test might be to replicate your table to a test server and try all your queries against him and see what the performance benefit is. InnoDB has a higher resource consumption rates then MyISAM, so keep that in mind.

Hardware

I'm sorry bud, but your hardware is crap for the performance you need. Twitter does 34 writes per second at 2.6k QPS. You can't be doing Twitter's volume and think a beefed up gaming desktop is going to cut it. Buy a $15k Dell with some SSD drives and you'll be able to burst 100k QPS. You're in the big times now. It's time to ditch the start-up gear and get yourself a nice server. You do not want to shard. It will be cheaper to upgrade your hardware, and frankly, you need to.

Sharding

Sharding is awesome for splitting up large tables. And that's it.

Let me be clear about the bad. Developing a sharded architecture sucks. You want to do everything possible to not shard. Upgrade hardware, buy multiple servers and set up replication, optimize your code, but for the love of God, do not shard. You are way below the performance line for sharding. When your pushing sustained 30k+ QPS, then we can talk sharding. Until that day, NO.

You can buy a medium-range server ($30k Dell PowerEdge) with 5TB of Fusion IO on 16 cores and 256 GB of RAM and he'll take you all the way to 200k QPS.

But if you refuse to listen to me and are going to shard anyway, then here's what you need to do.

Rule 1: Stay on the Same Shard (ie. Picking a Partition Rule)

Once you shard, you do not want to be accessing data from across multiple shards. You need to pick a partition rule that keeps your query on the same shard as much as possible. Distributing a query (rule 4) is incredibly painful in distributed data environments.

Rule 2: Build a Shard Map and Replicate it

Your code will need to be able to get to all shards. Create a shard map based on your partition rule that lets your code know where to go to get the data he wants.

Rule 3: Write a Query Wrapper for your Shards

You do not want to manually decide which shard to go to. Write a wrapper that does it for you. You will thank yourself down the road when you're writing code.

Rule 4: Auto-balance

You'll eventually need to balance your shards to keep performance optimal. Plan for this before-hand and write your code with the intention that you'll have some kron job which balances your shards for you.

Rule 4: Support Distributed Queries

You inevitably will need to break Rule 1. When that happens, you'll need a query wrapper that can pull data from multiple shards and aggregate (bring) it into one place. The more shards you have, the more likely this will need to be multi-threaded. In my shop, we call this a distributed query (ie. a query which runs on multiple shards).

Bad News: There is no code out there for doing distributed queries and aggregating results. Apache Hadoop tries, but he's terrible. So is HiveDB. A good query distributor is hard to architect, hard to write, hard to optimize. This is a problem billion-dollar a year companies deal with. I shit you not, but if you come up with a good wrapper for distributing queries across shards that supports sorting+limit clauses and scales well, you could be a millionaire over night. Selling it for $300,000? You would have a line outside your door a mile long.

My point here is sharding is hard and it is expensive. It takes a lot of work and you want to do everything humanly possible to not shard. If you must, follow the rules.