I have the following table to monitor user login to an application:
CREATE TABLE [dbo].[userActivity](
[userType] [nchar](10) NULL,
[userInstanceID] [nchar](10) NULL,
[userID] [nchar](10) NULL,
[login] [datetime2](7) NULL
) ON [PRIMARY]
GO
A particularity of my data is that a unique user is determined by the combination of userType
, userInstanceID
, and userID
.
For instance, in the following image, I have three distinct users:
My objective is to know:
I have some test data as:
INSERT [dbo].[userActivity] ([userType], [userInstanceID], [userID], [login]) VALUES (N'customer ', N'1 ', N'1 ', CAST(N'2020-09-17T18:00:07.2492412' AS DateTime2))
GO
INSERT [dbo].[userActivity] ([userType], [userInstanceID], [userID], [login]) VALUES (N'employee ', N'1 ', N'2 ', CAST(N'2020-09-18T09:00:07.2494560' AS DateTime2))
GO
INSERT [dbo].[userActivity] ([userType], [userInstanceID], [userID], [login]) VALUES (N'customer ', N'1 ', N'1 ', CAST(N'2020-08-17T03:00:07.2492412' AS DateTime2))
GO
INSERT [dbo].[userActivity] ([userType], [userInstanceID], [userID], [login]) VALUES (N'customer ', N'2 ', N'1 ', CAST(N'2020-07-23T10:00:07.2492412' AS DateTime2))
GO
INSERT [dbo].[userActivity] ([userType], [userInstanceID], [userID], [login]) VALUES (N'customer ', N'2 ', N'1 ', CAST(N'2020-10-25T11:00:07.2492412' AS DateTime2))
GO
I was able to get what I need with the following:
SELECT DISTINCT userType, userInstanceID, userID, numberOfLogins, MostRecentLogin, oldestLogin FROM (
SELECT userType, userInstanceID, userID,
COUNT(login) OVER(PARTITION BY userType, userInstanceID, userID ORDER BY userType, userInstanceID, userID) AS numberOfLogins,
max(login) OVER(PARTITION BY userType, userInstanceID, userID ORDER BY userType, userInstanceID, userID) AS MostRecentLogin,
min(login) OVER(PARTITION BY userType, userInstanceID, userID ORDER BY userType, userInstanceID, userID) AS oldestLogin
FROM dbo.userActivity) AS summary
My question is: is this and efficient method? I have millions of rows and around 20 columns to bring along for each user.
I appreciate any suggestion.
Thanks!
The first "smell" with what you wrote is that your PARTITION BY
columns in each case are a) the same and b) the only non-aggregated columns in the SELECT
list1.
The second "smell" is DISTINCT
. No, really. It's far too often used when someone says "Well, I'm getting the results I need, except I'm getting multiple rows when I only wanted one". The bad approach is to apply DISTINCT
and to not think about why you were getting those multiple results.
In your case you're getting multiple results because you're not aggregating correctly.
Looking back at your problem, you're saying "for each unique combination of these columns I want to computed these aggregates". Which is pretty well the definition of GROUP BY
2. So yes, the straightforward way of writing this query is:
select userType, userInstanceID, userID,
COUNT(*) as numLogins, MIN(login) as firstLogin, MAX(login) as lastLogin
from dbo.userActivity
group by userType, userInstanceID, userID
You'll notice this is shorter and uses fewer features, which is usually a way to tell you've got the query into a form most likely to be well optimized by the optimizer.
1Taken together, these mean that you're going to potentially compute exactly the same result rows multiple times. Whether you actually do depends on a) whether you have multiple rows with the same unique combinations and b) how smart the optimizer is being.
2I'd suggest strongly that any time you're contemplating DISTINCT
you ought to be thinking GROUP BY
. DISTINCT
is effectively GROUP BY *
3, but it's rare to not have aggregates when grouping.
3Except the *
being "all columns in the SELECT
clause" rather than "all columns produced by the FROM
/JOIN
s".