Search code examples
sqlms-accessgroup-byaggregatewindow-functions

SUM OVER with GROUP BY


I am working on a large database with millions of rows and I am trying to be efficient in my queries. The database contains regular snapshots of a loan portfolio where sometimes loans default (status goes from '1' to <>'1'). When they do, they appear only once in the corresponding snapshot, then they are no longer reported. I am trying to get a cumulative count of such loans - as they develop over time and divided into many buckets depending on country of origin, vintage, etc. SUM (...) OVER seems to be a very efficient function to achieve the result but when I run the following query

Select 
assetcountry, edcode, vintage, aa25 as inclusionYrMo, poolcutoffdate, aa74 as status, 
AA16 AS employment, AA36 AS product, AA48 AS newUsed, aa55 as customerType, 
count(1) as Loans, sum(aa26) as OrigBal, sum(aa27) as CurBal, 
SUM(count(1)) OVER (ORDER BY [poolcutoffdate] ROWS UNBOUNDED PRECEDING) as LoanCountCumul,
SUM(aa27) OVER (ORDER BY [poolcutoffdate] ROWS UNBOUNDED PRECEDING) as CurBalCumul,
SUM(aa26) OVER (ORDER BY [poolcutoffdate] ROWS UNBOUNDED PRECEDING) as OrigBalCumul
from myDatabase
where aa22>='2014-01' and aa22<='2014-12' and vintage='2015' and active=0 and aa74<>'1'
group by assetcountry, edcode, vintage, aa25, aa74, aa16, aa36, aa48, aa55, poolcutoffdate
order by poolcutoffdate

I get

SQL Error (8120) column aa27 is invalid in the selected list because it is not contained in either an aggregate function or the GROUP BY clause

Can anyone shed some light? Thanks


Solution

  • This is what I found to be working, comparing my results with some external research data. I have simplified the fields for readability:

        select 
          poolcutoffdate, 
          count(1) as LoanCount,
          MAX(sum(case status when 'default' then 1 else 0 end)) 
          over (order by poolcutoffdate 
                ROWS between unbounded preceding AND CURRENT ROW) as CumulDefaults
    
    from myDatabase
    group by poolcutoffdate
    order by poolcutoffdate asc
    

    I am thus counting all loans that have been in the 'default' status at least once from inception to the current cutoff date.

    Note the use of MAX(SUM()) so that the result is the largest of the various iteration from the first to the current row. Using SUM(SUM()) would add the various iterations leading to a cumulative of cumulatives.

    I considered using SUM(SUM()) with "PARTITION BY poolcutoffdate" so that the tally restarts from 0 and does not add from the previous cutoff date but this would only include loans from the latest cutoff so if a loan had defaulted and removed from the pool it would wrongly not be counted.

    Note the CASE in the OVER statement.

    Thanks for all the help