Search code examples
sql-servert-sqlwindow-functions

TSQL Window Function Best Practices


I've been using window functions more and more in my daily queries, and have been wondering if I am doing it correctly.

Let's say we have a database, dbo.songs which contains one record per song, with the following columns: artist, songName, and releaseDate.
For each artist, I want to select their first songName and releaseDate, ordered by releaseDate ascending. Please note that the decision to group on artist is arbitrary - tomorrow, I may need to group by a different column (BPM, album, length).

To do this, we have a few options:

Recently, I've been using the "bunch of identically scoped window functions" strategy, which would look something like this:

SELECT DISTINCT
    s.artist
    , FIRST_VALUE(s.songName) OVER (PARTITION BY s.artist ORDER BY s.releaseDate ASC) AS songName
    , FIRST_VALUE(s.releaseDate) OVER (PARTITION BY s.artist ORDER BY s.releaseDate ASC) AS releaseDate
FROM dbo.songs s

That seems sortof sloppy, doesn't it? It relies completely on the DISTINCT to avoid a million duplicate rows, and if you want to select additional fields (BPM, album, length) you need a bunch more window functions, which I believe would count as RBAR.

Option two is "figure out the keys and then join to self", which would look like this:

WITH earliestArtistRelease AS (
    SELECT
        s.artist
        , MIN(s.releaseDate) AS releaseDate
    FROM dbo.songs s
    GROUP BY s.artist
)

SELECT
    e.artist
    , e.releaseDate
    , s.songName
FROM dbo.songs s
INNER JOIN earliestArtistRelease e
ON s.releaseDate = e.releaseDate
    AND s.artist = e.artist

That gets the job done, but it doesn't seem all that efficient - especially if we don't have an index on releaseDate and artist. We also run into issues if an artist released two songs on one day. Additionally, if we're doing some funky prioritization (select the song release on 2018-01-01 if possible, otherwise the earliest released song), we can't do it nearly as easily as we could with the window functions: OVER (PARTITION BY s.artist ORDER BY IIF(s.releaseDate = '20180101', '19000101', s,releaseDate)), which is sort of hacky, but concise.

We have other options: self-CROSS APPLY, using ROW_NUMBER(), but from what I can tell these tend to be either less efficient, or less concise than the "bunch of identically scoped window functions" strategy outlined above.

So, my question: what is the best practice? How would you go about handling this, to both save processor cycles, as well as avoid doubling the length of the codebase? Is one option better inside a CTE, and another better for inserting into a temp table?

Any links to existing standards, papers or resources greatly appreciated.


Solution

  • 1) You should get your distinct artists first. If you already have an artist table then select from that. If you don't then create an artist table and have the songs table relate to that with a foreign key.

    2) Once you've done that then CROSS APPLY would be the proper operator to retrieve the related song data.

    SELECT a.artist, t.songName, t.releaseDate
    FROM artists a
    CROSS APPLY (
        SELECT TOP 1 s.songName, s.releaseDate
        FROM songs s
        WHERE s.artistId = a.artistId
        -- any other "funky" prioritization.
        ORDER BY s.releaseDate ASC
    ) topSongs t