Search code examples
sqlpostgresqlaggregategreatest-n-per-groupstring-parsing

Get group maxima from combined strings


I have a table with a column code containing multiple pieces of data like this:

001/2017/TT/000001
001/2017/TT/000002
001/2017/TN/000003
001/2017/TN/000001
001/2017/TN/000002
001/2016/TT/000001
001/2016/TT/000002
001/2016/TT/000001
002/2016/TT/000002

There are 4 items in 001/2016/TT/000001: 001, 2016, TT and 000001.
How can I extract the max for every group formed by the first 3 items? The result I want is this:

001/2017/TT/000003
001/2017/TN/000002
001/2016/TT/000002
002/2016/TT/000002

Edit

  • The subfield separator is /, and the length of subfields can vary.
  • I use PostgreSQL 9.3.

Solution

  • Obviously, you should normalize the table and split the combined string into 4 columns with proper data type. The function split_part() is the tool of choice if the separator '/' is constant in your string and the length of can vary.

    CREATE TABLE tbl_better AS 
    SELECT split_part(code, '/', 1)::int AS col_1  -- better names?
         , split_part(code, '/', 2)::int AS col_2
         , split_part(code, '/', 3)      AS col_3  -- text?
         , split_part(code, '/', 4)::int AS col_4
    FROM   tbl_bad
    ORDER  BY 1,2,3,4  -- optionally cluster data.
    

    Then the task is trivial:

    SELECT col_1, col_2, col_3, max(col_4) AS max_nr
    FROM   tbl_better
    GROUP  BY 1, 2, 3;
    

    Related:

    Of course, you can do it on the fly, too. For varying subfield length you could use substring() with a regular expression like this:

    SELECT max(substring(code, '([^/]*)$')) AS max_nr
    FROM   tbl_bad
    GROUP  BY substring(code, '^(.*)/');
    

    Related (with basic explanation for regexp pattern):

    Or to get only the complete string as result:

    SELECT DISTINCT ON (substring(code, '^(.*)/'))
           code
    FROM   tbl_bad
    ORDER  BY substring(code, '^(.*)/'), code DESC;
    

    About DISTINCT ON:

    Be aware that data items cast to a suitable type may behave differently from their string representation. The max of 900001 and 1000001 is 900001 for text and 1000001 for integer ...