sql-server performance t-sql distinct partitioning

T-SQL Efficient use of PARTITION BY and DISTINCT

I have the following table to monitor user login to an application:

CREATE TABLE [dbo].[userActivity](
    [userType] [nchar](10) NULL,
    [userInstanceID] [nchar](10) NULL,
    [userID] [nchar](10) NULL,
    [login] [datetime2](7) NULL
) ON [PRIMARY]
GO

A particularity of my data is that a unique user is determined by the combination of userType, userInstanceID, and userID.

For instance, in the following image, I have three distinct users:

customer 1 1 (red)
employee 1 2 (blue)
customer 2 1 (green)

My objective is to know:

The number of times each user has login
The most recent login
The oldest login

I have some test data as:

INSERT [dbo].[userActivity] ([userType], [userInstanceID], [userID], [login]) VALUES (N'customer  ', N'1         ', N'1         ', CAST(N'2020-09-17T18:00:07.2492412' AS DateTime2))
GO
INSERT [dbo].[userActivity] ([userType], [userInstanceID], [userID], [login]) VALUES (N'employee  ', N'1         ', N'2         ', CAST(N'2020-09-18T09:00:07.2494560' AS DateTime2))
GO
INSERT [dbo].[userActivity] ([userType], [userInstanceID], [userID], [login]) VALUES (N'customer  ', N'1         ', N'1         ', CAST(N'2020-08-17T03:00:07.2492412' AS DateTime2))
GO
INSERT [dbo].[userActivity] ([userType], [userInstanceID], [userID], [login]) VALUES (N'customer  ', N'2         ', N'1         ', CAST(N'2020-07-23T10:00:07.2492412' AS DateTime2))
GO
INSERT [dbo].[userActivity] ([userType], [userInstanceID], [userID], [login]) VALUES (N'customer  ', N'2         ', N'1         ', CAST(N'2020-10-25T11:00:07.2492412' AS DateTime2))
GO

I was able to get what I need with the following:

SELECT DISTINCT userType, userInstanceID, userID, numberOfLogins, MostRecentLogin, oldestLogin FROM (
    SELECT userType, userInstanceID, userID, 
        COUNT(login) OVER(PARTITION BY userType, userInstanceID, userID ORDER BY userType, userInstanceID, userID) AS numberOfLogins,
        max(login) OVER(PARTITION BY userType, userInstanceID, userID ORDER BY userType, userInstanceID, userID) AS MostRecentLogin,
        min(login) OVER(PARTITION BY userType, userInstanceID, userID ORDER BY userType, userInstanceID, userID) AS oldestLogin
        FROM dbo.userActivity) AS summary

My question is: is this and efficient method? I have millions of rows and around 20 columns to bring along for each user.

I appreciate any suggestion.

Thanks!

Solution

The first "smell" with what you wrote is that your PARTITION BY columns in each case are a) the same and b) the only non-aggregated columns in the SELECT list¹.

The second "smell" is DISTINCT. No, really. It's far too often used when someone says "Well, I'm getting the results I need, except I'm getting multiple rows when I only wanted one". The bad approach is to apply DISTINCT and to not think about why you were getting those multiple results.

In your case you're getting multiple results because you're not aggregating correctly.

Looking back at your problem, you're saying "for each unique combination of these columns I want to computed these aggregates". Which is pretty well the definition of GROUP BY². So yes, the straightforward way of writing this query is:

select userType, userInstanceID, userID,
       COUNT(*) as numLogins, MIN(login) as firstLogin, MAX(login) as lastLogin
from dbo.userActivity
group by userType, userInstanceID, userID

You'll notice this is shorter and uses fewer features, which is usually a way to tell you've got the query into a form most likely to be well optimized by the optimizer.

¹Taken together, these mean that you're going to potentially compute exactly the same result rows multiple times. Whether you actually do depends on a) whether you have multiple rows with the same unique combinations and b) how smart the optimizer is being.

²I'd suggest strongly that any time you're contemplating DISTINCT you ought to be thinking GROUP BY. DISTINCT is effectively GROUP BY *³, but it's rare to not have aggregates when grouping.

³Except the * being "all columns in the SELECT clause" rather than "all columns produced by the FROM/JOINs".