Search code examples
sqlsql-serveraggregate-functionsstring-concatenation

Optimal way to concatenate/aggregate strings


I'm finding a way to aggregate strings from different rows into a single row. I'm looking to do this in many different places, so having a function to facilitate this would be nice. I've tried solutions using COALESCE and FOR XML, but they just don't cut it for me.

String aggregation would do something like this:

id | Name                    Result: id | Names
-- - ----                            -- - -----
1  | Matt                            1  | Matt, Rocks
1  | Rocks                           2  | Stylus
2  | Stylus

I've taken a look at CLR-defined aggregate functions as a replacement for COALESCE and FOR XML, but apparently SQL Azure does not support CLR-defined stuff, which is a pain for me because I know being able to use it would solve a whole lot of problems for me.

Is there any possible workaround, or similarly optimal method (which might not be as optimal as CLR, but hey I'll take what I can get) that I can use to aggregate my stuff?


Solution

  • SOLUTION

    The definition of optimal can vary, but here's how to concatenate strings from different rows using regular Transact SQL, which should work fine in Azure.

    ;WITH Partitioned AS
    (
        SELECT 
            ID,
            Name,
            ROW_NUMBER() OVER (PARTITION BY ID ORDER BY Name) AS NameNumber,
            COUNT(*) OVER (PARTITION BY ID) AS NameCount
        FROM dbo.SourceTable
    ),
    Concatenated AS
    (
        SELECT 
            ID, 
            CAST(Name AS nvarchar) AS FullName, 
            Name, 
            NameNumber, 
            NameCount 
        FROM Partitioned 
        WHERE NameNumber = 1
    
        UNION ALL
    
        SELECT 
            P.ID, 
            CAST(C.FullName + ', ' + P.Name AS nvarchar), 
            P.Name, 
            P.NameNumber, 
            P.NameCount
        FROM Partitioned AS P
            INNER JOIN Concatenated AS C 
                    ON P.ID = C.ID 
                    AND P.NameNumber = C.NameNumber + 1
    )
    SELECT 
        ID,
        FullName
    FROM Concatenated
    WHERE NameNumber = NameCount
    

    EXPLANATION

    The approach boils down to three steps:

    1. Number the rows using OVER and PARTITION grouping and ordering them as needed for the concatenation. The result is Partitioned CTE. We keep counts of rows in each partition to filter the results later.

    2. Using recursive CTE (Concatenated) iterate through the row numbers (NameNumber column) adding Name values to FullName column.

    3. Filter out all results but the ones with the highest NameNumber.

    Please keep in mind that in order to make this query predictable one has to define both grouping (for example, in your scenario rows with the same ID are concatenated) and sorting (I assumed that you simply sort the string alphabetically before concatenation).

    I've quickly tested the solution on SQL Server 2012 with the following data:

    INSERT dbo.SourceTable (ID, Name)
    VALUES 
    (1, 'Matt'),
    (1, 'Rocks'),
    (2, 'Stylus'),
    (3, 'Foo'),
    (3, 'Bar'),
    (3, 'Baz')
    

    The query result:

    ID          FullName
    ----------- ------------------------------
    2           Stylus
    3           Bar, Baz, Foo
    1           Matt, Rocks