I'm a self taught programmer and I've always followed certain design parameters that were based more on common sense than research when it comes to building systems that scale. However, I just realized one component of my system might not be necessary.
Generally speaking I break user data into groups and assign it to specific mysql servers. When a content server behind a load balancer receives a request, I use data from the request (like a userid) to resolved the database where that users data is stored by querying a central table stored on DynamoDB which can handle an insane amount of load.
However, I also assign the user data to databases within the server. Like I'll have a 100 databases in each server that all have the same table structure, and I'll assign 250 users to each database.
The logic originally was that a table where each user has 2k entries is going to run way faster with 500k entries than 50 million. However, it occurred to me that breaking up user data this way might not make any sense at all. Indexes are pretty efficient. I'm sure the database actually had some kind of internal logic that allows it to access data at basically the same speed right? I've been doing this for ten years, and I just realized this might not be necessary at all. Any thoughts? Can I just make one database with all my tables in it or should I continue doing things the way I always have, sharding across 100 databases on a server?
This is a little theoretical, so it might be worth understanding the idea of Big-O complexity aka Time Complexity.
A clustered B-Tree index lookup for a single item is O(log(n)) where n is the number of rows in the table. DynamoDB is a hash-based implementation, which puts it much closer to O(1), meaning that it's performance does not appreciably change with content size.
Now for the math, log(500k) = 5.7, where log(50mil) = 7.7 Single-row lookups scale REALLY well, as long as you are avoiding hits to the disk to load the index into memory.
So, you are talking about a 25% difference for a single-row lookup. Which is significant, but still likely less than the overhead of a round-trip to another db system (like DynamoDB).
Of course, your mileage may vary, as there are concerns like keeping the index in memory, etc... So it's possible that you would see a difference in a production environment. I highly recommend setting up a test, and verify your performance.