I am new to sharding and wanted to know what implications sharding has for various queries. For the sample data set named "people":
person_id | person_fname | person_lname | person_dob
----------------------------------------------------
1 | John | Smith | 1972-03-04
2 | Sally | Jones | 1968-09-14
3 | Phil | Forrester | 1976-11-25
4 | Gwen | Langley | 1955-04-20
5 | Pedro | Romero | 1962-12-21
6 | Gene | Halford | 1978-01-11
7 | Juan | Peza | 1977-08-07
8 | Pierre | Henry | 1980-04-30
The data is sharded equally across four nodes by creating a hash of the surrogate identity "id". However, you need to perform read and write operations on records that potentially span all the nodes such as:
SELECT person_fname,
person_lname
FROM people
WHERE person_dob > '1970-01-01'
Or say you had a further table of "orders", which references "people" on the "person_id" column, and wanted to perform a join...
SELECT order_id,
order_amount,
order_date,
person_fname,
person_lname
FROM orders
LEFT JOIN people
WHERE order_amount > 50
Is it the case that in effect all of the nodes will run the query in parallel? I am assuming that each server will have less work to do for each step as instead of one instance running the query over eight records, simultaneously, four instances will run the query over two(ish) records, with the further benefit that if the DBMS is able to perform shard selection then the other nodes need not continue executing any further instructions, is this assumption correct?
Are there any known performance implications with sharding and complex joins (beyond that of this simple example)?
It will indeed allow that to be done in parallel.
It can indeed make joins complex, and hence slower, if they have to cross different shards.
However, with a many-to-one, if you had e.g. orders
sharded in such a way that all rows in the orders
table where in the same shard as the related row in the people
table, then this cross-shard problem doesn't happen.
You need to design your sharding approach so you get lots of cases like that and few (ideally none) where you end up crossing shards.
You also want to have your shard on the key you actually seek on the most. Eg. if you are finding people by username as your starting point to everything else, then you want to shard by username, not id, because when then finding them you already know which single shard to hit, rather than having to hit all of them just to get back zero rows from most.