I've broken my head against this issue trying to reproduce it on a dynamically generated data set, but it finally worked out!
This is the code
;with tbl as
(
select Id, ClCode, Manager, ChangeDate
from (values
(1, '000005', 'Cierra Vega', '2017-10-05'),
(2, '000005', 'Alden Cantrell', '2017-11-29'),
(3, '000005', 'Alden Cantrell', '2017-11-30'),
(4, '000005', 'Kierra Gentry', '2018-09-05'),
(5, '000005', 'Kierra Gentry', '2018-09-12'),
(6, '000005', 'Pierre Cox', '2018-11-06'),
(7, '000005', 'Thomas Crane', '2019-09-11'),
(8, '000005', 'Thomas Crane', '2019-10-01'),
(9, '000005', 'Miranda Shaffer', '2020-04-27'),
(10,'000360', 'Bradyn Kramer', '2017-10-06')
) as t(Id, ClCode, Manager, ChangeDate)
)
, grouped as
(
select c.ClCode
, count(distinct c.Manager) [ManagerChangesCount]
, STRING_AGG(c.[Manager], ',') within group (order by c.MinChangeDate) [Managers]
, STRING_AGG(c.MinChangeDate, ',') within group (order by c.MinChangeDate) [ChangeDates]
from (
select x.ClCode
, x.[Manager]
, min(x.ChangeDate) [MinChangeDate]
from tbl x
group by x.ClCode, x.[Manager]
) c
group by c.ClCode
)
select *
from grouped
My sample dataset contains data about when (ChangeDate) certain clients (ClCode) had their manager (Manager) changed. It is a part of the real DWH Clients dimension table (SCD Type 2), so these "duplicates" just contain changes in the other columns somewhere.
What I'm trying to achieve: I need a list of client codes with how many times their manager was changed and a comma separated list of these manager names sorted from left to right in their change order
ClCode ManagerChangesCount Managers ChangeDates
000005 6 Cierra Vega,Alden Cantrell,Kierra Gentry,Pierre Cox,Thomas Crane,Miranda Shaffer 2017-10-05,2017-11-29,2018-09-05,2018-11-06,2019-09-11,2020-04-27
000360 1 Bradyn Kramer 2017-10-06
But in fact I receive results without or with some weird sorting
ClCode ManagerChangesCount Managers ChangeDates
000005 6 Alden Cantrell,Cierra Vega,Kierra Gentry,Miranda Shaffer,Pierre Cox,Thomas Crane 2017-11-29,2017-10-05,2018-09-05,2020-04-27,2018-11-06,2019-09-11
000360 1 Bradyn Kramer 2017-10-06
This query returns good sorting:
, count(distinct c.Manager) [ManagerChangesCount]
where ClCode = '000005'
But all the same for me it looks like a bug... I mean unsorted results from my query
Guys, if you can figure this out, please help me to understand why sorting is not working
Firstly, I do agree that the behaviour you're getting shouldn't be happening, however, Stack Overflow isn't for reporting bugs with applications. For SQL Server, that should be done in their Azure Feedback portal.
As for resolving the issue, removing the redundant DISTINCT
from your COUNT
causes the problem to disappear. To implement a DISTINCT
(either in a SELECT DISTINCT
or a COUNT(DISTINCT {expression})
) SQL Server needs to first sort the results as then it can easily remove any values that have the same sort position. As a result that sort is being expressed in your STRING_AGG
expressions, even though they have an explicit ORDER BY
clause.
The reason I say your DISTINCT
is redundant is because at that point in the query there will be no duplicate values of Manager
for a given value of ClCode
. This is because you already grouped on both Manager
and ClCode
in the subquery. If you run that query alone, you'll see that Manager
doesn't have any duplicates:
WITH tbl AS
(SELECT Id,
ClCode,
Manager,
ChangeDate
FROM (VALUES (1, '000005', 'Cierra Vega', '2017-10-05'),
(2, '000005', 'Alden Cantrell', '2017-11-29'),
(3, '000005', 'Alden Cantrell', '2017-11-30'),
(4, '000005', 'Kierra Gentry', '2018-09-05'),
(5, '000005', 'Kierra Gentry', '2018-09-12'),
(6, '000005', 'Pierre Cox', '2018-11-06'),
(7, '000005', 'Thomas Crane', '2019-09-11'),
(8, '000005', 'Thomas Crane', '2019-10-01'),
(9, '000005', 'Miranda Shaffer', '2020-04-27'),
(10, '000360', 'Bradyn Kramer', '2017-10-06')) t (Id, ClCode, Manager, ChangeDate) )
SELECT x.ClCode,
x.[Manager],
MIN(x.ChangeDate) AS [MinChangeDate]
FROM tbl x
GROUP BY x.ClCode,
x.[Manager];
As such, the DISTINCT
in the COUNT
is just added overhead for the instance, as it's not required (SQL Server has already sorted the data for the GROUP BY
so why ask it to sort it again?). If you Are using a DISTINCT
in a query you've already aggregated, then you very likely don't need it.