Search code examples
sqlsql-servergroup-bymany-to-manyconcatenation

Retrieve data from SQL Server and concatenate results over rows based on grouping


I’ve been working on a problem for a couple of days and have finally managed to work out a solution that works for me. In case this solution is useful for someone else, I’m going to ask a question and answer it myself.

I have read-only access to a large SQL Server database containing in excess of 1 million records. Some of the tables in the database are linked in many-to-many relationships through lookup tables. To simplify matters, the tables can be illustrated as shown below:

table names
|-----------|
| id | name |
|----|------|
|  1 | dave |
|  2 | phil |
|  3 | john |       table foods_relationship        table clothes_relationship
|  4 | pete |       |--------------------------|    |----------------------------|
|-----------|       | id | names_id | foods_id |    | id | names_id | clothes_id |
                    |----|----------|----------|    |----|----------|------------|
table foods         |  1 |        1 |        1 |    |  1 |        1 |          1 |
|---------------|   |  2 |        1 |        3 |    |  2 |        1 |          3 |
| id | food     |   |  3 |        1 |        4 |    |  3 |        1 |          4 |
|----|----------|   |  4 |        2 |        2 |    |  4 |        2 |          2 |
|  1 | beef     |   |  5 |        2 |        3 |    |  5 |        2 |          3 |
|  2 | tomatoes |   |  6 |        2 |        4 |    |  6 |        2 |          4 |
|  3 | bacon    |   |  7 |        2 |        5 |    |  7 |        3 |          1 |
|  4 | cheese   |   |  8 |        3 |        3 |    |  8 |        3 |          3 |
|  5 | apples   |   |  9 |        3 |        5 |    |  9 |        3 |          5 |
|---------------|   | 10 |        4 |        1 |    | 10 |        4 |          2 |
                    | 11 |        4 |        2 |    | 11 |        4 |          4 |
table clothes       | 12 |        4 |        3 |    | 12 |        4 |          5 |
|---------------|   | 13 |        4 |        5 |    |----------------------------|
| id | clothes  |   |--------------------------|
|----|----------|
|  1 | trousers |
|  2 | shorts   |
|  3 | shirt    |
|  4 | socks    |
|  5 | jumper   |
|  6 | jacket   |
|---------------|

The tables can be recreated using the following SQL (adapted from MySQL database so may need minor tweaking to work in SQL Server):

CREATE TABLE `clothes` (
  `id` int(11) unsigned NOT NULL AUTO_INCREMENT,
  `clothes` varchar(32) DEFAULT NULL,
  PRIMARY KEY (`id`)
);

INSERT INTO `clothes` (`id`, `clothes`)
VALUES
    (1,'trousers'),
    (2,'shorts'),
    (3,'shirt'),
    (4,'socks'),
    (5,'jumper'),
    (6,'jacket');

CREATE TABLE `clothes_relationships` (
  `id` int(11) unsigned NOT NULL AUTO_INCREMENT,
  `names_id` int(11) DEFAULT NULL,
  `clothes_id` int(11) DEFAULT NULL,
  PRIMARY KEY (`id`)
);

INSERT INTO `clothes_relationships` (`id`, `names_id`, `clothes_id`)
VALUES
    (1,1,1),
    (2,1,3),
    (3,1,4),
    (4,2,2),
    (5,2,3),
    (6,2,4),
    (7,3,1),
    (8,3,3),
    (9,3,5),
    (10,4,2),
    (11,4,4),
    (12,4,5);

CREATE TABLE `food_relationships` (
  `id` int(11) unsigned NOT NULL AUTO_INCREMENT,
  `names_id` int(11) DEFAULT NULL,
  `foods_id` int(11) DEFAULT NULL,
  PRIMARY KEY (`id`)
);

INSERT INTO `food_relationships` (`id`, `names_id`, `foods_id`)
VALUES
    (1,1,1),
    (2,1,3),
    (3,1,4),
    (4,2,2),
    (5,2,3),
    (6,2,4),
    (7,2,5),
    (8,3,3),
    (9,3,5),
    (10,4,1),
    (11,4,2),
    (12,4,3),
    (13,4,5);

CREATE TABLE `foods` (
  `id` int(11) unsigned NOT NULL AUTO_INCREMENT,
  `food` varchar(32) DEFAULT NULL,
  PRIMARY KEY (`id`)
);

INSERT INTO `foods` (`id`, `food`)
VALUES
    (1,'beef'),
    (2,'tomatoes'),
    (3,'bacon'),
    (4,'cheese'),
    (5,'apples');

CREATE TABLE `names` (
  `id` int(11) unsigned NOT NULL AUTO_INCREMENT,
  `name` varchar(32) DEFAULT NULL,
  PRIMARY KEY (`id`)
);

INSERT INTO `names` (`id`, `name`)
VALUES
    (1,'dave'),
    (2,'phil'),
    (3,'john'),
    (4,'pete');

I want to query the database and – somehow – get the following output:

|-------------------------------------------------------------|
| name | food                         | clothes               |
|------|------------------------------|-----------------------|
| dave | beef,cheese,bacon            | trousers,socks,shirt  |
| john | apples,bacon                 | jumper,shirt,trousers |
| pete | beef,apples,bacon,tomatoes   | shorts,jumper,socks   |
| phil | bacon,tomatoes,apples,cheese | shirt,shorts,socks    |
|-------------------------------------------------------------|

However, running a SELECT query that joins the ‘names’ table to one or both of the other tables (via the respective lookup tables) results in multiple rows for each name. For example:

SELECT
    names.name,
    foods.food

FROM
    names
    LEFT JOIN food_relationships ON names.id = food_relationships.names_id
    LEFT JOIN foods ON food_relationships.foods_id = foods.id;

...produces the following set of results:

|-----------------|
| name | food     |
|------|----------|
| dave | beef     |
| dave | bacon    |
| dave | cheese   |
| phil | tomatoes |
| phil | bacon    |
| phil | cheese   |
| phil | apples   |
| john | bacon    |
| john | apples   |
| pete | beef     |
| pete | tomatoes |
| pete | bacon    |
| pete | apples   |
|-----------------|

The problem is compounded even further if the SELECT query returns data from both tables:

SELECT
    names.name,
    foods.food,
    clothes.clothes

FROM
    names
    LEFT JOIN food_relationships ON names.id = food_relationships.names_id
    LEFT JOIN foods ON food_relationships.foods_id = foods.id
    LEFT JOIN clothes_relationships ON names.id = clothes_relationships.names_id
    LEFT JOIN clothes ON clothes_relationships.clothes_id = clothes.id;

|-----------------------------|
| name | food     | clothes   |
|------|----------|-----------|
| dave | beef     | trousers  |
| dave | beef     | shirt     |
| dave | beef     | socks     |
| dave | bacon    | trousers  |
| dave | bacon    | shirt     |
| dave | bacon    | socks     |
| dave | cheese   | trousers  |
| dave | cheese   | shirt     |
| dave | cheese   | socks     |
| phil | tomatoes | shorts    |
| phil | tomatoes | shirt     |
| phil | tomatoes | socks     |
| phil | bacon    | shorts    |
| phil | bacon    | shirt     |
| phil | bacon    | socks     |
| phil | cheese   | shorts    |
| phil | cheese   | shirt     |
| phil | cheese   | socks     |
| phil | apples   | shorts    |
| phil | apples   | shirt     |
| phil | apples   | socks     |
| ...
| etc.

The question is, how can I query the SQL Server database to retrieve all the data but process it to have only one line per person?


Solution

  • If the database were MySQL, the solution would be relatively easy because MySQL has a GROUP_CONCAT function which concatenates rows. So, for just one of the tables, I could use:

    SELECT
        names.name,
        GROUP_CONCAT(foods.food)
    
    FROM
        names
        LEFT JOIN food_relationships ON names.id = food_relationships.names_id
        LEFT JOIN foods ON food_relationships.foods_id = foods.id
    
    GROUP BY (names.name);
    

    ...to give:

    name    food
    dave    beef,cheese,bacon
    john    apples,bacon
    pete    beef,apples,bacon,tomatoes
    phil    bacon,tomatoes,apples,cheese
    

    To get equivalent data from both ‘names’ and ‘clothes’ tables, I could use something like:

    SELECT
        temp_foods_table.name               AS 'name',
        temp_foods_table.food               AS 'food',
        temp_clothes_table.clothes          AS 'clothes'
    
    FROM
    (
        SELECT
            names.name,
            GROUP_CONCAT(foods.food)        AS 'food'
    
        FROM
            names
            LEFT JOIN food_relationships ON names.id = food_relationships.names_id
            LEFT JOIN foods ON food_relationships.foods_id = foods.id
    
        GROUP BY (names.name)
    
    ) AS temp_foods_table
    
    LEFT JOIN
    
    (
        SELECT
            names.name,
            GROUP_CONCAT(clothes.clothes)    AS 'clothes'
    
        FROM
            names
            LEFT JOIN clothes_relationships ON names.id = clothes_relationships.names_id
            LEFT JOIN clothes ON clothes_relationships.clothes_id = clothes.id
    
        GROUP BY (names.name)
    
    ) AS temp_clothes_table
    
    ON temp_foods_table.name = temp_clothes_table.name;
    

    ...to give the following results:

    name    food                            clothes
    dave    beef,cheese,bacon               trousers,socks,shirt
    john    apples,bacon                    jumper,shirt,trousers
    pete    beef,apples,bacon,tomatoes      shorts,jumper,socks
    phil    bacon,tomatoes,apples,cheese    shirt,shorts,socks
    

    However, the situation in SQL SERVER appears to be much less straight-forward. For a single table there are some proposed solutions online which include the use of common table expressions or FOR XML PATH. However, all the solutions appear to have drawbacks and give the distinct impression that they are work-arounds rather than specifically-designed features. Each suggested solution has some weakness (for example, the FOR XML PATH solution assumes that the text is XML and, therefore, special characters included in the text can cause problems). In addition, some commenters expressed concerns that such work-arounds were based on undocumented or deprecated features and, as a result, may not be reliable in the long-term.

    As a result, I decided not to tie myself up in SQL knots but to process the data post-retrieval using Python and Pandas. I always transfer data to a Pandas dataframe for plotting and analysis anyway so this wasn't a major inconvenience. In order to concatenate the data over several columns, I used groupby(). However, since there were two many-to-many tables, there was duplication in each column and, therefore, the final concatenated string contained all those duplications. In order to have only unique values, I used Python sets (which, by definition, can only contain unique values). The only potential drawback with this method is that the order of the strings is not maintained but, for my situation, that isn't an issue. The final Python solution looked something like this:

    Import necessary libraries:

    >>> import pandas as pd
    >>> import pymssql
    >>> import getpass
    

    Input necessary details to connect to database:

    >>> myServer = input("Enter server address: ")
    >>> myUser = input("Enter username: ")
    >>> myPwd = getpass.getpass("Enter password: ")
    

    Create a connection:

    >>> myConnection = pymssql.connect(server=myServer, user=myUser, password=myPwd, port='1433')
    

    Define a query to retrieve the necessary data:

    >>> myQuery = """SELECT
                             names.name,
                             foods.food,
                             clothes.clothes
    
                         FROM
                             names
                             LEFT JOIN food_relationships ON names.id = food_relationships.names_id
                             LEFT JOIN foods ON food_relationships.foods_id = foods.id
                             LEFT JOIN clothes_relationships ON names.id = clothes_relationships.names_id
                             LEFT JOIN clothes ON clothes_relationships.clothes_id = clothes.id """
    

    Run query, put results in dataframe and close connection:

    >>> myLatestData = pd.io.sql.read_sql(myQuery, con=myConnection)
    >>> myConnection.close()
    

    Concatenate strings in multiple rows and remove duplicates:

    >>> tempDF = tempDF.groupby('name').agg(lambda col: ','.join(set(col)))
    

    Print final dataframe:

    >>> print(tempDF)
    
    name                          food                clothes
    dave             beef,bacon,cheese   socks,trousers,shirt
    john                  bacon,apples  jumper,trousers,shirt
    pete    tomatoes,beef,bacon,apples    socks,jumper,shorts
    phil  tomatoes,bacon,cheese,apples     socks,shorts,shirt
    

    For me, this solution makes much more intuitive sense than trying to do all the data processing within the SQL query. Hope this helps someone else.