mysql ruby-on-rails postgresql heroku pg

Converting MySQL "get top y for each n" GROUP BY query to PostgreSQL

Ok so I'm going from mysql to postgresql because of Heroku. I have this statement that works perfect in mysql, but I can't seem to figure out the best way / any way for PG.

m = C.select("m.derp, c.id").joins("join m on m.c_id = c.id").where("m.id IN (SELECT max(id) from m group by c_id order by updated_at desc)").order("m.updated_at desc")

So basically I need to grab m's as grouped by c, ordered by the latest created m's. Postgres won't let me select multiple columns without them being in the group by call, which will give me more rows than I want. Otherwise I'll get the

column X must appear in the GROUP BY clause or be used in an aggregate function

error.

Any ideas?

Solution

This is a common issue on migration to Pg, because MySQL permits you to write non-standard GROUP BY clauses that may have non-deterministic results.

If you:

SELECT a, b FROM sometable GROUP BY a;

and there's >1 distinct value for b in the table, how does the database know which row to return?

For PostgreSQL, if you want this kind of behaviour you must use the DISTINCT ON extension instead, as PostgreSQL follows the standard by rejecting possibly non-deterministic GROUP BY clauses. Something like:

SELECT DISTINCT ON (a) a, b FROM sometable ORDER BY a;

... assuming that this is really what you intended. Your question doesn't really cover the underlying problem you're trying to solve - why you're trying to do this - so it's hard to say if this is the right approach.

Update: Based on the explanation in the comments, in your case I think you can use a correlated subquery with ORDER BY ... LIMIT 1 and an equality test, instead of your current uncorrelated subquery with IN.

Translating your Rails-code to SQL, I get something like:

SELECT m.derp, c.id
FROM c
INNER JOIN m ON m.c_id = c.id
WHERE m.id IN (SELECT max(id) from m group by c_id order by updated_at desc)
ORDER BY m.updated_at desc;

which you explain as being intended to find the newest message on each conversation.

If so, instead try:

SELECT m.derp, c.id
FROM c
INNER JOIN m ON m.c_id = c.id
WHERE m.id = (SELECT m2.id FROM m m2 WHERE m2.c_id = c.id ORDER BY m2.updated_at LIMIT 1)
ORDER BY m.updated_at desc;

The subquery runs once for each outer c row, so it's vital that m.c_id have an index. In fact, I think it'll perform best with an index on

m (c_id, updated_at)

or if you're on PostgreSQL 9.2 or above with index-only scan support, possibly:

m (c_id, updated_at, id)

though I'd need to test both those with dummy data and schema to be 100% sure.