Search code examples
sql-serversql-server-2017string-agg

SQL Server STRING_AGG function sorting is not working as expected


I've broken my head against this issue trying to reproduce it on a dynamically generated data set, but it finally worked out!

This is the code

;with tbl as
(
    select Id,  ClCode, Manager,    ChangeDate
    from (values
    (1, '000005',   'Cierra Vega',      '2017-10-05'),
    (2, '000005',   'Alden Cantrell',   '2017-11-29'),
    (3, '000005',   'Alden Cantrell',   '2017-11-30'),
    (4, '000005',   'Kierra Gentry',    '2018-09-05'),
    (5, '000005',   'Kierra Gentry',    '2018-09-12'),
    (6, '000005',   'Pierre Cox',       '2018-11-06'),
    (7, '000005',   'Thomas Crane',     '2019-09-11'),
    (8, '000005',   'Thomas Crane',     '2019-10-01'),
    (9, '000005',   'Miranda Shaffer',  '2020-04-27'),
    (10,'000360',   'Bradyn Kramer',    '2017-10-06')
    ) as t(Id, ClCode, Manager, ChangeDate)
)
, grouped as                        
(                       
    select c.ClCode
    , count(distinct c.Manager) [ManagerChangesCount]
    , STRING_AGG(c.[Manager], ',') within group (order by c.MinChangeDate) [Managers]
    , STRING_AGG(c.MinChangeDate, ',') within group (order by c.MinChangeDate) [ChangeDates]
    from (                  
        select x.ClCode
        , x.[Manager]
        , min(x.ChangeDate) [MinChangeDate] 
        from tbl x
        group by x.ClCode, x.[Manager]  
    ) c                 
    group by c.ClCode                   
)
select *
from grouped

My sample dataset contains data about when (ChangeDate) certain clients (ClCode) had their manager (Manager) changed. It is a part of the real DWH Clients dimension table (SCD Type 2), so these "duplicates" just contain changes in the other columns somewhere.

What I'm trying to achieve: I need a list of client codes with how many times their manager was changed and a comma separated list of these manager names sorted from left to right in their change order

ClCode  ManagerChangesCount             Managers                                                                            ChangeDates
000005  6                               Cierra Vega,Alden Cantrell,Kierra Gentry,Pierre Cox,Thomas Crane,Miranda Shaffer    2017-10-05,2017-11-29,2018-09-05,2018-11-06,2019-09-11,2020-04-27
000360  1                               Bradyn Kramer                                                                       2017-10-06

But in fact I receive results without or with some weird sorting

ClCode  ManagerChangesCount Managers                                                                                        ChangeDates
000005  6                               Alden Cantrell,Cierra Vega,Kierra Gentry,Miranda Shaffer,Pierre Cox,Thomas Crane    2017-11-29,2017-10-05,2018-09-05,2020-04-27,2018-11-06,2019-09-11
000360  1                               Bradyn Kramer                                                                       2017-10-06

This query returns good sorting:

  1. if I remove , count(distinct c.Manager) [ManagerChangesCount]
  2. if I remove string id=10 (second client)
  3. if I add a filtering condition where where ClCode = '000005'
  4. if I get rid of duplicate rows in a dataset and remove group by from a subquery

But all the same for me it looks like a bug... I mean unsorted results from my query

Guys, if you can figure this out, please help me to understand why sorting is not working


Solution

  • Firstly, I do agree that the behaviour you're getting shouldn't be happening, however, Stack Overflow isn't for reporting bugs with applications. For SQL Server, that should be done in their Azure Feedback portal.

    As for resolving the issue, removing the redundant DISTINCT from your COUNT causes the problem to disappear. To implement a DISTINCT (either in a SELECT DISTINCT or a COUNT(DISTINCT {expression})) SQL Server needs to first sort the results as then it can easily remove any values that have the same sort position. As a result that sort is being expressed in your STRING_AGG expressions, even though they have an explicit ORDER BY clause.

    The reason I say your DISTINCT is redundant is because at that point in the query there will be no duplicate values of Manager for a given value of ClCode. This is because you already grouped on both Manager and ClCode in the subquery. If you run that query alone, you'll see that Manager doesn't have any duplicates:

    WITH tbl AS
        (SELECT Id,
                ClCode,
                Manager,
                ChangeDate
         FROM (VALUES (1, '000005', 'Cierra Vega', '2017-10-05'),
                      (2, '000005', 'Alden Cantrell', '2017-11-29'),
                      (3, '000005', 'Alden Cantrell', '2017-11-30'),
                      (4, '000005', 'Kierra Gentry', '2018-09-05'),
                      (5, '000005', 'Kierra Gentry', '2018-09-12'),
                      (6, '000005', 'Pierre Cox', '2018-11-06'),
                      (7, '000005', 'Thomas Crane', '2019-09-11'),
                      (8, '000005', 'Thomas Crane', '2019-10-01'),
                      (9, '000005', 'Miranda Shaffer', '2020-04-27'),
                      (10, '000360', 'Bradyn Kramer', '2017-10-06')) t (Id, ClCode, Manager, ChangeDate) )
    SELECT x.ClCode,
           x.[Manager],
           MIN(x.ChangeDate) AS [MinChangeDate]
    FROM tbl x
    GROUP BY x.ClCode,
             x.[Manager];
    

    As such, the DISTINCT in the COUNT is just added overhead for the instance, as it's not required (SQL Server has already sorted the data for the GROUP BY so why ask it to sort it again?). If you Are using a DISTINCT in a query you've already aggregated, then you very likely don't need it.