Search code examples
mysqlsqlmaxgreatest-n-per-group

SQL max value in one column


I'm trying to get records that have the highest value in one particular column ("version"). I'm using the base_id to get rows, and there may be more than one row with the same base_id, but they will then have different version numbers. So the point of the statement is to only get the one with the highest version. And the statement below works, but only if there are actually more than one value. If there is only one I get no records at all back (as opposed to the expected one row). So how can I get only the value with the highest version number below, even if for some records only one version exists?:

SELECT r.id
     , r.title
     , u.name created_by
     , m.name modified_by
     , r.version
     , r.version_displayname
     , r.informationtype
     , r.filetype
     , r.base_id
     , r.resource_id
     , r.created
     , r.modified
     , GROUP_CONCAT( CONCAT(CAST(c.id as CHAR),',',c.name,',',c.value) separator ';') categories 
  FROM resource r 
  JOIN category_resource cr 
    ON r.id = cr.resource_id 
  JOIN category c 
    ON cr.category_id = c.id 
  JOIN user u 
    ON r.created_by = u.id 
  JOIN user m 
    ON r.modified_by = m.id 
 WHERE r.base_id = 'uuid_033a7198-a213-11e3-93de-2b47e5a489c2' 
   AND r.version = (SELECT MAX(r.version) FROM resource r) 
 GROUP 
    BY r.id;

EDIT:

I realize the other parts of the query itself may complicate things, so I'll try to create a cleaner example, which should show what I'm after, I hope.

If I do this:

SELECT id, title, MAX(version) AS 'version' FROM resource GROUP BY title

on a table that looks like this:

enter image description here

Then I get the following results:

enter image description here

which is not correct, as you can see from the table. I.e, it's fetched the highest value for each resource, but if you look at Introduction, e.g. the resource with the value 2 for version has the id 6, whereas the one fetched has the id 1. So the query seems to somehow combine the values from different rows...?

I should note that I'm very much a novice at SQL, and the original query that I exemplified the problem with was something I got help with here, so please do explain as clearly as possible, thanks.

Another note is that I found some suggestion of a subquery, but apart from not returning the correct results either, it was really slow. I'm testing on 5000 rows and I really need to expect it to take only a fraction of a second, in order to meet performance requirements.

EDIT 2:

Found a way to incorporate a statement, sort of like one of the suggested ones, as well as the various solutions here: Retrieving the last record in each group

However, I tried them all, and even though most seem to work, they are incredibly slow…

Take this one:

SELECT
  r.id, r.title,
  u.name AS 'created_by', m.name AS 'modified_by', r.version, r.version_displayname, r.informationtype,
r.filetype, r.base_id, r.resource_id, r.created, r.modified,
  GROUP_CONCAT( CONCAT(CAST(c.id as CHAR),',',c.name,',',c.value) separator ';') AS 'Categories'
FROM
  resource r
  INNER JOIN
  (SELECT
   DISTINCT r.id AS id
  FROM
   resource r
  INNER JOIN
   category_resource cr1 ON (r.id = cr1.resource_id)
  WHERE
   cr1.category_id IN (9)
) mr

    ON r.id = mr.id
  INNER JOIN category_resource cr
    ON r.id = cr.resource_id
  INNER JOIN category c
    ON cr.category_id = c.id
  INNER JOIN user u
    ON r.created_by = u.id
  INNER JOIN user m
    ON r.modified_by = m.id
INNER JOIN
(
    SELECT max(version) MyVersion, base_id
    FROM resource
    GROUP BY base_id
) r2
  ON r.base_id = r2.base_id
  AND r.version = r2.MyVersion
group by r.base_id
order by r.version desc;

The addition at the end (starting with the INNER JOIN) to get only the rows with the highest version value for each base_id slows the query down from 20 ms to around 6-8 seconds. That is a no go… But this surprises me. Although I’m obviously no database expert, it seems to me that database queries should be optimized for getting data like this. But if I do the only alternative I can think of, which is to get all the records regardless of version number, and then filter them in PHP, guess what? That is much faster than this…

I initially thought the performance hit caused by filtering in PHP was too much, but that is about a second’s delay, so still much better than this.

But I feel like I’m missing something, shouldn’t it be possible to do this much more efficiently?


Solution

  • Well, I think I found the answer myself. As far as I can understand, a query like this will take a lot of time, and instead the database needs to be modified. I found this:

    How to version control a record in a database

    The suggestion to use startend and enddate columns and set the enddate to null for the latest version made it very easy to do queries for the latest version. And it is again very very fast. So this is what I needed. It gives me something like this, all put together:

    SELECT
      r.id, r.title,
      u.name AS 'created_by', m.name AS 'modified_by', r.version, r.version_displayname, r.informationtype,
    r.filetype, r.base_id, r.resource_id, r.created, r.modified,
      GROUP_CONCAT( CONCAT(CAST(c.id as CHAR),',',c.name,',',c.value) separator ';') AS 'categories', startdate, enddate
    FROM
      resource r
      INNER JOIN 
      (SELECT
       DISTINCT r.id AS id
      FROM
       resource r
      INNER JOIN
       category_resource cr1 ON (r.id = cr1.resource_id)
      WHERE
       cr1.category_id IN (9)
    ) mr
    
        ON r.id = mr.id
      INNER JOIN category_resource cr
        ON r.id = cr.resource_id
      INNER JOIN category c
        ON cr.category_id = c.id
      INNER JOIN user u
        ON r.created_by = u.id
      INNER JOIN user m
        ON r.modified_by = m.id
    WHERE r.enddate is null
    group by r.id;
    

    And this query once again is back to the 20 ms execution time.