Marco Arment, the programmer who developed Tumblr has a blog post where he says that a database join can be avoided by (e.g.) doing the following two queries:
$user_ids = $following->query_return_column_array(
'SELECT user_id FROM ?table WHERE following_id = ?i', $this->id
);
$followed_users = $user->find(
'SELECT * FROM ?table WHERE id IN ?ai', $user_ids
);
I know some SQL, but for some reason I'm unable to understand the exact point he is making here.
My Question: can you please break down the exact point the code is making? Thanks.
Post-"Correct Answer" Note to self:
following
is a 2 column table which contains user_id
which is anyone's user id, and following_id
which is a column of people who are being followed. Thus, the table maps people being followed and their followers.
user_ids
is a vector of the ids of all the users who follow a particular user (e.g Taylor Swift).
followed_users
IMHO is a misnomer. I would have named it followers_full_details
or following_users
. followed_users
is a subset of the rows of the user
table containing all the details of the followers of a particular person (Taylor Swift).
user
is a table that contains all the details of each user of the social network
Of course.
When you do a JOIN
over two large tables, the JOIN
might turn out to be very slow, so, if one table has n records and the other has m records, then the JOIN is doing n * m JOIN
comparisons. Let's suppose that n = m = 100 000 000. In this case n * m = 100 000 000 * 100 000 000 = 10 000 000 000 000 000.
So, if you know that in essence you will have a relatively small amount of records to join, you can search your small number of records from n records and some other small number of records from m records, let's assume that you will end up with 100 records for the first query and 10 records for the second. This would mean that you query from n records initially, then from m records and then from 100 * 10 (either a third query or some logic to parse the result, it does not matter at this point). So the magnitude is of
100 000 000 + 100 000 000 + 100 * 10 = 200 000 000 + 1000 = 200 001 000.
Now, let's compare our two magnitudes:
10 000 000 000 000 000
vs.
200 001 000
We can see that there are scenarios when large joins are suboptimal in comparison to pre-selecting the "interesting" records from the tables you would otherwise join and then parse them.
Of course, joins can largely be improved by indexing fields or field sets, for example, but there are some scenarios that are largely unimprovable. So, whenever you have a JOIN
or some other multidimensional query that is underperforming, a possible way of optimization is to pre-select the component tables and do the join on the results, as long as the results are relatively small.
Another scenario when you might want to avoid joins is when you have the same records on one dimension used for multiple joins. In this case it might make sense to pre-select the repetitive records and remove a dimension from the other searches by substituting a reference to a table with a whitelist set.
EDIT
Let's break down the actual code given in the question as well:
$user_ids = $following->query_return_column_array(
'SELECT user_id FROM ?table WHERE following_id = ?i', $this->id
);
$followed_users = $user->find(
'SELECT * FROM ?table WHERE id IN ?ai', $user_ids
);
The example scenario has user_id
and following_id
as fields in the first queries. This is a scenario often seen at social networks, such as Facebook, X, Quora, etc. where users have the ability to follow each-other, i.e., a user will be notified about the actions made by the users they follow.
The scenario here has the first query searching the users a certain user follows. The certain user who is the follower of the users we search for has the id
of $this->id
, which is being passed as a query parameter to the value with which matching records would have an equal following_id
value, that is, we are searching for users who are followed by the user whose id is $this->id
. From these records, or, in other words: from the people followed by the current user we are interested to know their ids. This is why user_id
is being selected and ultimately stored into $user_ids
.
The next query is searching the attributes of these followers.
So, we have a following
table which has a user_id
and a following_id
field, each record representing that the user_id
stored there is following following_id
. Assuming that we search for users followed by user 123, this is equivalent to
SELECT user_id
FROM following
WHERE following_id = 123;
if there are not too many people followed by this user, then we will have reasonably few results. Let's assume that the ids of 198423, 12345, 56258 and 438397 are the user_id values of interest here. Then, we can run
SELECT *
FROM users
WHERE id IN (198423, 12345, 56258, 438397);
Assuming that there are 100 000 000 users and 500 000 000 followings, assuming that a user follows 5 users on average, joining 100 000 000 x 500 000 000 records might end up to be extremely suboptimal in comparison to finding 4 or 5 ids from 500 000 000 followings and then search by those ids among 100 000 000 users.
It's 100 000 000 x 500 000 000 vs. 100 000 000 + 500 000 000
user_ids
: a collection of ids of the users followed by a specific user in a social networkfollowing
: a table where tuples of (user_id, following_id)
are stored, where user_id
is the id
of the following user and following_id
is the id
of the followed userfollowing_id
: the id
of the followed userfollowed_users
: users followed by the user whose id is $this->id
ai
: It's a placeholder for the $user_ids
parameter, most probably meaning array of integers, but nevertheless, what's important for you is that it will evaluate to (value1, value2, ...)