Search code examples
databaseruby-on-rails-3postgresqlrdbms

How can I get 20 users and 5 articles from each user?


Users can have many Articles.

I want to get 20 of the most recent users and for each user, 5 articles.

I was reading on http://wiki.postgresql.org/wiki/Find_recent_activity but this seems overly complicated for my scenario. However, the query mentioned in the article is stunningly fast! So I've been told. Perhaps there is a way to incorporate some of the methodologies used in the query?

I am using Postgres 9.2


Solution

  • Assuming that "most recent" user means "most recently created":

    Dummy data setup

    First, generate some dummy data. I haven't bothered prettily formatting this:

    create table users ( id serial primary key, username text not null,created_at timestamptz not null default current_timestamp );
    create table articles (id serial primary key, user_id integer not null references users(id), created_at timestamptz not null default current_timestamp);
    insert into users (username) values ('todd'),('bob'),('janet'),('joan'),('jim'),('jolly'),('roger'),('yarr'),('fred');
    update users set created_at = current_timestamp + id * INTERVAL '1' MINUTE;
    insert into articles(user_id, created_at) select u.id, x from users u cross join generate_series(current_timestamp, current_timestamp + INTERVAL '1' HOUR, INTERVAL '1' MINUTE) x;
    

    PostgreSQL 9.3beta1 and above: LATERAL

    Now, this is one of those times where it's a pity you're not on 9.3; there you could use a lateral subquery to do all this rather nicely:

    SELECT u.username, a.id AS article_id 
    FROM (
        SELECT u1.id, u1.username 
        FROM users u1 
        ORDER BY u1.created_at DESC LIMIT 5
    ) u,
    LATERAL (
        SELECT a1.id
        FROM articles a1 
        WHERE a1.user_id = u.id
        ORDER BY a1.created_at DESC LIMIT 5
    ) a;
    

    See: http://www.depesz.com/2012/08/19/waiting-for-9-3-implement-sql-standard-lateral-subqueries/

    However, since 9.3 isn't quite released yet, it's not surprising you aren't using it.

    PostgreSQL 9.2 and older:

    With 9.2 and older you have to use another layer of subqueries to work around the lack of LATERAL support you're in for some ugly workarounds involving the row_number window function and nested subqueries. See Grouped LIMIT in PostgreSQL: show the first N rows for each group?, http://www.postgresql.org/message-id/4CD0B077.2080700@ateb.com, http://troels.arvin.dk/db/rdbms/#select-top-n

    Something like:

    WITH
    last_five_users AS (
            SELECT u1.id, u1.username FROM users u1 ORDER BY u1.created_at DESC LIMIT 5
    )
    SELECT
            lfa.username, lfa.article_id
    FROM
    (
            SELECT lfive.username, lfive.id, a.id, row_number() OVER (PARTITION BY a.user_id ORDER BY created_at)
            FROM articles a
            INNER JOIN last_five_users lfive ON (a.user_id = lfive.id)
    ) AS lfa(username, user_id, article_id, rownum)
    WHERE lfa.rownum <= 10;
    

    (in this case for five users and 10 articles per user).

    Efficiency will be awful if the users selected have lots of articles, because it will fetch and number all articles for those users, not just the first n, then discard most of them in the outer query.

    If that's a problem you could create an SQL set returning function:

    CREATE OR REPLACE FUNCTION last_n_articles_for_user(user_id integer, n_articles integer)
    RETURNS SETOF articles AS $$
    SELECT * FROM articles WHERE articles.user_id = $1 ORDER BY created_at LIMIT $2
    $$ LANGUAGE sql STABLE;
    

    then use that in your main query:

    SELECT
      o.username,
      (o.last_articles).*
    FROM (
            SELECT
                    u.username,
                    last_n_articles_for_user(u.id, 10) AS last_articles
            FROM (
                    SELECT *
                    FROM users u1
                    ORDER BY u1.created_at DESC
                    LIMIT 5
            ) u
    ) AS o;
    

    This is likely to perform a great deal better where there's an index on created_at, though not as well as the LATERAL approach enabled in 9.3. The extra layer of subquery is required because invoking last_n_articles_for_user directly with the .* syntax will - due to an internal PostgreSQL limitation on row-returning functions - cause it to be invoked once for each column.