Search code examples
sqlsql-servergroup-byaggregate-functions

Multiple STRING_AGG on multiple join columns causes bloated aggregation


I've got a table in my MSSQL server, lets call it blogPost. I've also got two tag tables, lets call them fooTag and barTag. The tag tables are used to tag the blogPost table which are identically structured.

blogPost

| postId | title               |        body |
+--------+---------------------+-------------+
| 1      | The life on a query | lorem ipsum |
+--------+---------------------+-------------+

fooTag and barTag

| postId | tagName      |
+--------+--------------+
| 1      | sql          |
| 1      | query        |
| 1      | select-query |
+--------+--------------+

I want to get a single blogpost along with all it's tags in a single row so then STRING_AGG() feels suitable to make a query like this:

SELECT blogPost.*, STRING_AGG(fooTag.tagName, ';') as [fooTags], STRING_AGG(barTag.tagName, ';') as [barTags]
FROM blogPost
LEFT JOIN fooTag ON blogPost.postId = fooTag.postId
LEFT JOIN barTag ON blogPost.postId = barTag.postId
WHERE postId = 1
GROUP BY blogPost.postId, title, body

When making this query I'd expect to get the result

| postId | title               |        body | fooTags                 | barTags                 |
+--------+---------------------+-------------+-------------------------+-------------------------+
| 1      | The life on a query | lorem ipsum | sql;query;select-query | sql;query;select-query |
+--------+---------------------+-------------+-------------------------+-------------------------+

But I'm getting this result instead where bar tags (i.e. the last STRING_AGG selected) are duplicated.

| postId | title               |        body | fooTags                 | barTags                                       |
+--------+---------------------+-------------+-------------------------+-----------------------------------------------+
| 1      | The life on a query | lorem ipsum | sql;query;select-query; | sql;sql;sql;query;query;query;select-query;select-query;select-query |
+--------+---------------------+-------------+-------------------------+-----------------------------------------------+

Putting barTags last in the SELECT statement makes it so that barTags gets the duplicates instead of fooTags. The amount of duplicates created seem to be bound to the amount of rows columns being aggregated together in the first STRING_AGG result column, so if fooTags has 5 rows to aggregate together there will be 5 duplicates of each barTag in the barTags column in the result.

How would I get the result I want without duplicates?


Solution

  • Your problem is caused by each row in fooTags creating that many rows of barTags in the JOIN, hence the duplication. You can work around this issue by performing the STRING_AGG in the footags and bartags tables before JOINing them:

    SELECT blogPost.*, f.tags as [fooTags], b.tags as [barTags]
    FROM blogPost
    LEFT JOIN (SELECT postId, STRING_AGG(tagName, ';') AS tags
               FROM fooTag
               GROUP BY postId) f ON blogPost.postId = f.postId
    LEFT JOIN (SELECT postId, STRING_AGG(tagName, ';') AS tags
               FROM barTag
               GROUP BY postId) b ON blogPost.postId = b.postId
    WHERE postId = 1