Search code examples
postgresqlgreatest-n-per-group

Postgresql finding max transaction_id for each type giving duplicates (when it's not supposed to for PK)


Question as title; So I have a code as shown below to find the ID with highest amount transacted by type of card

SELECT tr.identifier, cc.type, tr.amount as max_amount
FROM credit_cards cc, transactions tr 
WHERE (tr.amount, cc.type) IN (SELECT MAX(tr.amount), cc.type   
                FROM credit_cards cc, transactions tr 
                WHERE cc.number = tr.number
                 GROUP BY cc.type)
GROUP BY tr.identifier, cc.type;

When I run the code, I get duplicate transaction_identifier which shouldn't happen since it's the PK of the transactions table; output when I run above code is shown below

ID --------Card type--------------- Max amount
2196    "diners-club-carte-blanche" 1000.62
2196        "visa"                  1000.62
11141   "mastercard"                1000.54
2378    "mastercard"                1000.54

e.g. 2196 in above exists for diners carte-blanche not visa; 'mastercard' is correct since 2 different IDs can have same max transaction.

However, this code should run because it is possible for 2 different id to have the same max amount for each type.

Does anyone know how to prevent the duplicates from occurring?

is this due to the WHERE ... IN clause which matches either the max amount or the card type? (the ones with duplicate is Visa and Diners-Carte-Blanche which both have same max value of 1000.62 so I think that's where they're matching wrong)


Solution

  • TL/DR: add WHERE cc.number = tr.number to the outer query.

    Long version

    When you query FROM table_1, table_2 in the outer query and don't connect the tables (via a join or where clause) the result is a cartesian product, meaning EVERY row from table_1 is joined to EVERY row from table_2. This is the same as a CROSS JOIN.

    So while your inner query has a where clause and (correctly) returns the max for each credit card type... your outer query does not, and so all possible combinations of credit card and transaction are being compared to the maximums, not just the valid ones.

    For example, if cc has rows three rows (mastercard, visa, amex) and tr has three rows (1,2,3) selecting "from cc, tr" is resulting in nine rows:

    mastercard,1
    mastercard,2
    mastercard,3
    visa,1
    visa,2
    visa,3
    amex,1
    amex,2
    amex,3
    

    where what you want is:

    mastercard,1
    visa,3
    amex,2
    

    Each row in the first table will be repeated for each row in the second. Then the WHERE (...) IN (...) restrict this set of rows to only those that match a row in the inner query. As you can imagine, this can easily lead to duplicate results. Some of those duplicates are being removed by the outer GROUP BY, which should not be necessary once this issue is fixed.

    As a general rule, I never use join [table_1], [table_2] and prefer to ALWAYS be explicit about doing an inner or outer join (or, in some situations, a cross join) to help avoid this kind of issue and make it clearer to the reader.

    SELECT tr.identifier, cc.type, tr.amount as max_amount
    FROM credit_cards cc INNER JOIN transactions tr ON (cc.number = tr.number)
    WHERE (tr.amount, cc.type) IN (
      SELECT MAX(tr.amount), cc.type   
      FROM credit_cards cc 
      INNER JOIN transactions tr ON (cc.number = tr.number)
      GROUP BY cc.type
    )
    
    

    NOTE: In the case of a tie, this will give you every transaction for each credit card type that is tied for the maximum amount.