mysql ruby-on-rails ruby-on-rails-4 activerecord database-performance

Rails: Performance issue with joining of records

I have the following setup with ActiveRecord and MySQL:

User has many groups through memberships
Group has many users through memberships

There is also an index by group_id and user_id described in schema.rb:

add_index "memberships", ["group_id", "user_id"], name: "uugj_index", using: :btree

3 different queries:

User.where(id: Membership.uniq.pluck(:user_id))

(3.8ms) SELECT DISTINCT memberships.user_id FROM memberships User Load (11.0ms) SELECT users.* FROM users WHERE users.id IN (1, 2...)

User.where(id: Membership.uniq.select(:user_id))

User Load (15.2ms) SELECT users.* FROM users WHERE users.id IN (SELECT DISTINCT memberships.user_id FROM memberships)

User.uniq.joins(:memberships)

User Load (135.1ms) SELECT DISTINCT users.* FROM users INNER JOIN memberships ON memberships.user_id = users.id

What is the best approach for doing this? Why the query with join is much slower?

Solution

The first query is bad because it sucks all of the user ids into a Ruby array and then sends them back to the database. If you have a lot of users, that's a huge array and a huge amount of bandwidth, plus 2 roundtrips to the database instead of one. Furthermore, the database has no way to efficiently handle that huge array.

The second and third approaches are both efficient database-driven solutions (one is a subquery, and one is a join), but you need to have the proper index. You need an index on the memberships table on user_id.

add_index :memberships, :user_id

The index that you already have, would only be helpful if you wanted to find all of the users that belong to a particular group.

Update:

If you have a lot of columns and data in your users table, the DISTINCT users.* in the 3rd query is going to be fairly slow because MySQL has to compare a lot of data in order to ensure uniqueness.

To be clear: this is not intrinsic slowness with JOIN, it's slowness with DISTINCT. For example: Here is a way to avoid the DISTINCT and still use a JOIN:

SELECT users.* FROM users
INNER JOIN (SELECT DISTINCT memberships.user_id FROM memberships) AS user_ids
ON user_ids.user_id = users.id;

Given all of that, in this case, I believe the 2nd query is going to be the best approach for you. The 2nd query should be even faster than reported in your original results if you add the above index. Please retry the second approach, if you haven't done so yet since adding the index.

Although the 1st query has some slowness issues of its own, from your comment, it's clear that it is still faster than the 3rd query (at least, for your particular dataset). The trade-offs of these approaches is going to depend on your particular dataset in regards to how many users you have and how many memberships you have. Generally speaking, I believe the 1st approach is still the worst even if it ends up being faster.

Also, please note that the index I'm recommending is particularly designed for the three queries you listed in your question. If you have other kinds of queries against these tables, you may be better served by additional indexes, or possibly multi-column indexes, as @tata mentioned in his/her answer.