Search code examples
sqlaggregate-functionsgreatest-n-per-groupamazon-athenapresto

SQL syntax greatest-n-per-group + aggregation in athena


ive spent hours on this so far, im using aws athena and am not getting any further, i think there is something i am missing:

so i have a table like so

------------------------------------------------------------------
caseid | postcode | streetname | state | dateandtime             
-----------------------------------------------------------------
123123 | 4000     | arthur     | QLD   | 2018-09-30 10:32:51.000 
------------------------------------------------------------------

Now this table will have multiple duplicate caseid's which i want to get the latest by dateandtime, i have figured out i can do the following:

SELECT b.caseid, MAX(b.dateandtime) as dateandtime
FROM  jsonmanual b
GROUP BY b.caseid

Which works how I want it to.

Now I need to filter these results with a between statement on the dateandtime and grab a count of the postcode/streetname/state from these unique entries which i have not been able to do, below is my current leading guesstimate, to show a count of the postcodes between two timesstamps:

SELECT a.postcode, count(a.postcode) as countof
FROM  jsonmanual a
INNER JOIN (
    SELECT distinct b.caseid, MAX(b.dateandtime) as dateandtime, b.postcode
    FROM  jsonmanual b
    GROUP BY b.caseid, b.postcode
) b ON a.caseid = b.caseid and a.postcode = b.postcode
where dateandtime between TIMESTAMP '2016-05-05 09:51:00' and TIMESTAMP '2020-01-10 15:36:00'
group by a.postcode

ANy help would be greatly appreaciated, as you can probably tell i am not much of a SQL guy, but aiming to get better :-)

SQLFiddle: http://www.sqlfiddle.com/#!9/2f4fbd/1

My ideal output

--------------------
|postcode | countof |
|-------------------|
|1166     | 1       |
|1231     | 1       |
|2171     | 1       |
|3651     | 1       |
|4469     | 1       |
|4697     | 2       |
--------------------

Solution

  • amazon-athena support window function so, you can try to use ROW_NUMBER [window function][1] make row number order by dateExact desc then get row number is 1 row.

    Next step use COUNT and group by

    Schema (MySQL v8.0)

    CREATE TABLE cases
        (`country` varchar(3), `vetClinic` varchar(11), `ageMonths` int, `vaxStatus` varchar(11), `patientId` long, `ageWeeks` int, `methodDiag` varchar(8), `dateExact` varchar(19), `vetName` varchar(14), `streetName` varchar(5), `caseNumber` int, `caseId` varchar(36), `dataOrigin` varchar(10), `datePresented` varchar(19), `state` varchar(3), `vaxDate` varchar(19), `cognitoSubNumber` varchar(36), `dateAndTime` varchar(19), `streetNumber` int, `postcode` int, `clinicalSigns` varchar(8), `caseOutCome` varchar(7), `isOpen` varchar(4), `ageYears` int, `species` varchar(8), `suburb` varchar(13), `vaxBrand` varchar(7))
    ;
    
    INSERT INTO cases
        (`country`, `vetClinic`, `ageMonths`, `vaxStatus`, `patientId`, `ageWeeks`, `methodDiag`, `dateExact`, `vetName`, `streetName`, `caseNumber`, `caseId`, `dataOrigin`, `datePresented`, `state`, `vaxDate`, `cognitoSubNumber`, `dateAndTime`, `streetNumber`, `postcode`, `clinicalSigns`, `caseOutCome`, `isOpen`, `ageYears`, `species`, `suburb`, `vaxBrand`)
    VALUES
        ('AUS', 'whoopwhoop', 9, 'vaxinated', 9839815985, 9, 'vomiting', '2019-05-05 09:54:26', 'adam de mamp', 'ann', 3, '2edd7dd0-c49c-11e8-b678-a5dc64edc7ee', 'ParvoAlert', '2019-08-19 06:50:59', 'SA', '2019-04-02 19:52:07', 'c70c64ad-d1d0-40be-86e6-a96de1b8de8b', '2018-09-30 10:32:51', 126, 3651, 'hat', 'alive', 'True', 9, 'pug', 'carindale', 'digimon'),
        ('AUS', 'whoopwhoop', 9, 'vaxinated', 9839815985, 9, 'vomiting', '2019-05-05 09:52:26', 'adam de mamp', 'buts', 3, '2edd7dd0-c49c-11e8-b678-a5dc64edc7ee', 'poops', '2019-08-19 06:50:59', 'SA', '2019-04-02 19:52:07', 'c70c64ad-d1d0-40be-86e6-a96de1b8de8b', '2018-09-30 10:32:51', 126, 3651, 'hat', 'alive', 'True', 9, 'pug', 'carindale', 'digimon'),
        ('AUS', 'whoopwhoop', 9, 'vaxinated', 9839815985, 9, 'rash', '2019-05-05 09:51:26', 'adam de mamp', 'ann', 3, '2ecb7c70-c49c-11e8-b678-a5dc64edc7ee', 'ParvoAlert', '2019-08-19 06:50:59', 'SA', '2019-04-02 19:52:07', 'c70c64ad-d1d0-40be-86e6-a96de1b8de8b', '2018-09-30 10:32:51', 126, 3651, 'hat', 'alive', 'True', 9, 'pug', 'carindale', 'digimon'),
        ('AUS', 'rbh', 9, 'vaxinated', 2114598894, 4, 'blood', '2019-01-10 15:36:29', 'adam de mamp', 'queen', 2, '2ed78a60-c49c-11e8-b678-a5dc64edc7ee', 'ParvoAlert', '2018-09-30 19:28:34', 'WA', '2019-01-19 03:38:28', 'c70c64ad-d1d0-40be-86e6-a96de1b8de8b', '2018-09-30 10:32:51', 39, 1166, 'hat', 'ongoing', 'True', 1, 'pitbull', 'carindale', 'digimon'),
        ('AUS', 'rbh', 9, 'unvaxinated', 9606793080, 46, 'blood', '2018-11-01 16:18:51', 'sumo man', 'annie', 1, '2edabeb0-c49c-11e8-b678-a5dc64edc7ee', 'ParvoAlert', '2018-10-14 16:21:43', 'ACT', '2018-12-10 03:36:49', 'c70c64ad-d1d0-40be-86e6-a96de1b8de8b', '2018-09-30 10:32:51', 59, 1231, 'bad', 'ongoing', 'True', 12, 'aligator', 'fendalton', 'digimon'),
        ('AUS', 'rbh', 12, 'unvaxinated', 2406607356, 47, 'blood', '2018-12-18 05:36:22', 'adam de mamp', 'annie', 3, '2eddf300-c49c-11e8-b678-a5dc64edc7ee', 'ParvoAlert', '2019-05-12 22:21:49', 'TA', '2019-03-15 17:28:35', 'c70c64ad-d1d0-40be-86e6-a96de1b8de8b', '2018-09-30 10:32:51', 180, 2171, 'hat', 'dead', 'True', 7, 'staffy', 'brisbane city', 'digimon'),
        ('AUS', 'examplevet', 2, 'vaxinated', 2449508561, 4, 'rash', '2018-12-07 15:36:05', 'anders holmvic', 'annie', 3, '2ed196f0-c49c-11e8-b678-a5dc64edc7ee', 'ParvoAlert', '2019-04-12 04:31:22', 'WA', '2019-02-13 17:09:51', 'c70c64ad-d1d0-40be-86e6-a96de1b8de8b', '2018-09-30 10:32:51', 10, 4450, 'fateigue', 'alive', 'True', 14, 'aligator', 'spring hill', 'varex'),
        ('AUS', 'rural', 6, 'vaxinated', 3900464429, 33, 'rash', '2019-09-24 15:03:15', 'adam de mamp', 'queen', 2, '2ed47d20-c49c-11e8-b678-a5dc64edc7ee', 'ParvoAlert', '2019-06-02 20:01:12', 'NSW', '2019-02-19 10:10:35', 'c70c64ad-d1d0-40be-86e6-a96de1b8de8b', '2018-09-30 10:32:51', 129, 4697, 'fateigue', 'dead', 'True', 15, 'staffy', 'balanora', 'suplex'),
        ('AUS', 'Vets are us', 9, 'unvaxinated', 8871302949, 1, 'vomiting', '2019-03-29 09:17:00', 'Lucy foxtrot', 'annie', 1, '2edd7dd0-c49c-11e8-b678-a5dc64edc7ee', 'ParvoAlert', '2018-11-21 08:51:38', 'SA', '2019-02-04 06:05:07', 'c70c64ad-d1d0-40be-86e6-a96de1b8de8b', '2018-09-30 10:32:51', 67, 4469, 'hat', 'dead', 'True', 13, 'aligator', 'carindale', 'digimon')
    ;
    

    Query #1

    SELECT postcode ,COUNT(*) FROM (
      SELECT t1.*,ROW_NUMBER() OVER(PARTITION BY caseid ORDER BY dateExact desc) rn
      FROM cases t1
    ) t1
    where rn = 1
    group by postcode;
    
    | postcode | COUNT(*) |
    | -------- | -------- |
    | 3651     | 2        |
    | 4450     | 1        |
    | 4697     | 1        |
    | 1166     | 1        |
    | 1231     | 1        |
    | 2171     | 1        |
    

    View on DB Fiddle