I'm finishing a web application and preparing for the possibility of 10 million active users each year. I'm looking at one table in MySQL that will have 100's of rows per user. I'm a programmer not a DBA and thought duplicating that table structure 50 times could work. Each User would be assigned a permanent table decided by their user_id. Those tables would not JOIN or connect with each other but will JOIN with a few other tables. This is only to manage the very large number of rows expected. Is this a valid design or am I missing something? While researching I see Partioning mentioned. Should I be looking into that? Below illustrates what I have in mind. Thank you.
User
=======
1 John
2 Adam
3 Jane
TableNum = ceiling( user_id )
Table1 ... Table50
======= =======
Plan A: Single server
A single server can handle 10M users with a total of 1B rows of data. I will give you these tips:
CREATE TABLE users (
user_id INT UNSIGNED NOT NULL AUTO_INCREMENT PRIMARY KEY,
name ...
etc...
) ENGINE=InnoDB;
CREATE TABLE the_billion_row_table (
user_id INT UNSIGNED NOT NULL,
id BIGINT AUTO_INCREMENT,
etc...
PRIMARY KEY (user_id, id), -- to 'cluster' rows for a user
INDEX(id) -- to keep AUTO_INCREMENT happy
etc...
) ENGINE=InnoDB;
This slightly-odd combination of PKs will help optimize all the queries that include
WHERE user_id = ? AND ...
Splitting the data into 50 tables is not beneficial; do not do it. Ditto for PARTITION
.
Plan B: "Sharding" across multiple servers.
When the traffic is too much for a single server, and you have done other optimizations (such as in Plan A), "sharding" is likely to be the solution.
This is where some of the users are on each of several servers. (You will unlikely need 50, but I will use that for discussion.)
To find where the user is, there must be some kind of "proxy". This can either be a proxy server that is available for free or purchase, or it can be homegrown, or it can be built into the client code. This is similar to your "pick which of 50 tables" but it is now "which of 50 servers".
Sharding makes it very difficult to do queries across multiple users. If you need that, we need to start over in the discussion.
Since one user may grow and another may shrink, there is a "balancing" problem. This can be solved by moving users between shards (lots of had-crafted code) or by not filling up a shard but instead preemptively spinning up a new shard.
Mapping user_id to shard number can be done in several ways. A purely "algorithmic" (modulo or hash) mapping has the problem of exacerbating the "balancing" problem. A "dictionary" approach requires some form of lookup to discover which server contains the data for the user as they login.
Further discussion of sharding is beyond the scope of this Q&A.