Search code examples
databasejoin

two-query substitution to avoid a database join


Marco Arment, the programmer who developed Tumblr has a blog post where he says that a database join can be avoided by (e.g.) doing the following two queries:

$user_ids = $following->query_return_column_array(
    'SELECT user_id FROM ?table WHERE following_id = ?i', $this->id
);

$followed_users = $user->find(
    'SELECT * FROM ?table WHERE id IN ?ai', $user_ids
);

I know some SQL, but for some reason I'm unable to understand the exact point he is making here.

My Question: can you please break down the exact point the code is making? Thanks.


Post-"Correct Answer" Note to self:

  • following is a 2 column table which contains user_id which is anyone's user id, and following_id which is a column of people who are being followed. Thus, the table maps people being followed and their followers.

  • user_ids is a vector of the ids of all the users who follow a particular user (e.g Taylor Swift).

  • followed_users IMHO is a misnomer. I would have named it followers_full_details or following_users. followed_users is a subset of the rows of the user table containing all the details of the followers of a particular person (Taylor Swift).

  • user is a table that contains all the details of each user of the social network


Solution

  • Of course.

    When you do a JOIN over two large tables, the JOIN might turn out to be very slow, so, if one table has n records and the other has m records, then the JOIN is doing n * m JOIN comparisons. Let's suppose that n = m = 100 000 000. In this case n * m = 100 000 000 * 100 000 000 = 10 000 000 000 000 000.

    So, if you know that in essence you will have a relatively small amount of records to join, you can search your small number of records from n records and some other small number of records from m records, let's assume that you will end up with 100 records for the first query and 10 records for the second. This would mean that you query from n records initially, then from m records and then from 100 * 10 (either a third query or some logic to parse the result, it does not matter at this point). So the magnitude is of

    100 000 000 + 100 000 000 + 100 * 10 = 200 000 000 + 1000 = 200 001 000.

    Now, let's compare our two magnitudes:

    10 000 000 000 000 000

    vs.

    200 001 000

    We can see that there are scenarios when large joins are suboptimal in comparison to pre-selecting the "interesting" records from the tables you would otherwise join and then parse them.

    Of course, joins can largely be improved by indexing fields or field sets, for example, but there are some scenarios that are largely unimprovable. So, whenever you have a JOIN or some other multidimensional query that is underperforming, a possible way of optimization is to pre-select the component tables and do the join on the results, as long as the results are relatively small.

    Another scenario when you might want to avoid joins is when you have the same records on one dimension used for multiple joins. In this case it might make sense to pre-select the repetitive records and remove a dimension from the other searches by substituting a reference to a table with a whitelist set.

    EDIT

    Let's break down the actual code given in the question as well:

    $user_ids = $following->query_return_column_array(
        'SELECT user_id FROM ?table WHERE following_id = ?i', $this->id
    );
    
    $followed_users = $user->find(
        'SELECT * FROM ?table WHERE id IN ?ai', $user_ids
    );
    

    The example scenario has user_id and following_id as fields in the first queries. This is a scenario often seen at social networks, such as Facebook, X, Quora, etc. where users have the ability to follow each-other, i.e., a user will be notified about the actions made by the users they follow.

    The scenario here has the first query searching the users a certain user follows. The certain user who is the follower of the users we search for has the id of $this->id, which is being passed as a query parameter to the value with which matching records would have an equal following_id value, that is, we are searching for users who are followed by the user whose id is $this->id. From these records, or, in other words: from the people followed by the current user we are interested to know their ids. This is why user_id is being selected and ultimately stored into $user_ids.

    The next query is searching the attributes of these followers.

    So, we have a following table which has a user_id and a following_id field, each record representing that the user_id stored there is following following_id. Assuming that we search for users followed by user 123, this is equivalent to

    SELECT user_id
    FROM following
    WHERE following_id = 123;
    

    if there are not too many people followed by this user, then we will have reasonably few results. Let's assume that the ids of 198423, 12345, 56258 and 438397 are the user_id values of interest here. Then, we can run

    SELECT *
    FROM users
    WHERE id IN (198423, 12345, 56258, 438397);
    

    Assuming that there are 100 000 000 users and 500 000 000 followings, assuming that a user follows 5 users on average, joining 100 000 000 x 500 000 000 records might end up to be extremely suboptimal in comparison to finding 4 or 5 ids from 500 000 000 followings and then search by those ids among 100 000 000 users.

    It's 100 000 000 x 500 000 000 vs. 100 000 000 + 500 000 000

    • user_ids: a collection of ids of the users followed by a specific user in a social network
    • following: a table where tuples of (user_id, following_id) are stored, where user_id is the id of the following user and following_id is the id of the followed user
    • following_id: the id of the followed user
    • followed_users: users followed by the user whose id is $this->id
    • ai: It's a placeholder for the $user_ids parameter, most probably meaning array of integers, but nevertheless, what's important for you is that it will evaluate to (value1, value2, ...)