Search code examples
sql-serverperformancet-sqldistinctpartitioning

T-SQL Efficient use of PARTITION BY and DISTINCT


I have the following table to monitor user login to an application:

CREATE TABLE [dbo].[userActivity](
    [userType] [nchar](10) NULL,
    [userInstanceID] [nchar](10) NULL,
    [userID] [nchar](10) NULL,
    [login] [datetime2](7) NULL
) ON [PRIMARY]
GO

A particularity of my data is that a unique user is determined by the combination of userType, userInstanceID, and userID.

For instance, in the following image, I have three distinct users:

  1. customer 1 1 (red)
  2. employee 1 2 (blue)
  3. customer 2 1 (green)

enter image description here

My objective is to know:

  1. The number of times each user has login
  2. The most recent login
  3. The oldest login

I have some test data as:

INSERT [dbo].[userActivity] ([userType], [userInstanceID], [userID], [login]) VALUES (N'customer  ', N'1         ', N'1         ', CAST(N'2020-09-17T18:00:07.2492412' AS DateTime2))
GO
INSERT [dbo].[userActivity] ([userType], [userInstanceID], [userID], [login]) VALUES (N'employee  ', N'1         ', N'2         ', CAST(N'2020-09-18T09:00:07.2494560' AS DateTime2))
GO
INSERT [dbo].[userActivity] ([userType], [userInstanceID], [userID], [login]) VALUES (N'customer  ', N'1         ', N'1         ', CAST(N'2020-08-17T03:00:07.2492412' AS DateTime2))
GO
INSERT [dbo].[userActivity] ([userType], [userInstanceID], [userID], [login]) VALUES (N'customer  ', N'2         ', N'1         ', CAST(N'2020-07-23T10:00:07.2492412' AS DateTime2))
GO
INSERT [dbo].[userActivity] ([userType], [userInstanceID], [userID], [login]) VALUES (N'customer  ', N'2         ', N'1         ', CAST(N'2020-10-25T11:00:07.2492412' AS DateTime2))
GO

I was able to get what I need with the following:

SELECT DISTINCT userType, userInstanceID, userID, numberOfLogins, MostRecentLogin, oldestLogin FROM (
    SELECT userType, userInstanceID, userID, 
        COUNT(login) OVER(PARTITION BY userType, userInstanceID, userID ORDER BY userType, userInstanceID, userID) AS numberOfLogins,
        max(login) OVER(PARTITION BY userType, userInstanceID, userID ORDER BY userType, userInstanceID, userID) AS MostRecentLogin,
        min(login) OVER(PARTITION BY userType, userInstanceID, userID ORDER BY userType, userInstanceID, userID) AS oldestLogin
        FROM dbo.userActivity) AS summary

enter image description here

My question is: is this and efficient method? I have millions of rows and around 20 columns to bring along for each user.

I appreciate any suggestion.

Thanks!


Solution

  • The first "smell" with what you wrote is that your PARTITION BY columns in each case are a) the same and b) the only non-aggregated columns in the SELECT list1.

    The second "smell" is DISTINCT. No, really. It's far too often used when someone says "Well, I'm getting the results I need, except I'm getting multiple rows when I only wanted one". The bad approach is to apply DISTINCT and to not think about why you were getting those multiple results.

    In your case you're getting multiple results because you're not aggregating correctly.

    Looking back at your problem, you're saying "for each unique combination of these columns I want to computed these aggregates". Which is pretty well the definition of GROUP BY2. So yes, the straightforward way of writing this query is:

    select userType, userInstanceID, userID,
           COUNT(*) as numLogins, MIN(login) as firstLogin, MAX(login) as lastLogin
    from dbo.userActivity
    group by userType, userInstanceID, userID
    

    You'll notice this is shorter and uses fewer features, which is usually a way to tell you've got the query into a form most likely to be well optimized by the optimizer.


    1Taken together, these mean that you're going to potentially compute exactly the same result rows multiple times. Whether you actually do depends on a) whether you have multiple rows with the same unique combinations and b) how smart the optimizer is being.

    2I'd suggest strongly that any time you're contemplating DISTINCT you ought to be thinking GROUP BY. DISTINCT is effectively GROUP BY *3, but it's rare to not have aggregates when grouping.

    3Except the * being "all columns in the SELECT clause" rather than "all columns produced by the FROM/JOINs".