I’ve been working on a problem for a couple of days and have finally managed to work out a solution that works for me. In case this solution is useful for someone else, I’m going to ask a question and answer it myself.
I have read-only access to a large SQL Server database containing in excess of 1 million records. Some of the tables in the database are linked in many-to-many relationships through lookup tables. To simplify matters, the tables can be illustrated as shown below:
table names
|-----------|
| id | name |
|----|------|
| 1 | dave |
| 2 | phil |
| 3 | john | table foods_relationship table clothes_relationship
| 4 | pete | |--------------------------| |----------------------------|
|-----------| | id | names_id | foods_id | | id | names_id | clothes_id |
|----|----------|----------| |----|----------|------------|
table foods | 1 | 1 | 1 | | 1 | 1 | 1 |
|---------------| | 2 | 1 | 3 | | 2 | 1 | 3 |
| id | food | | 3 | 1 | 4 | | 3 | 1 | 4 |
|----|----------| | 4 | 2 | 2 | | 4 | 2 | 2 |
| 1 | beef | | 5 | 2 | 3 | | 5 | 2 | 3 |
| 2 | tomatoes | | 6 | 2 | 4 | | 6 | 2 | 4 |
| 3 | bacon | | 7 | 2 | 5 | | 7 | 3 | 1 |
| 4 | cheese | | 8 | 3 | 3 | | 8 | 3 | 3 |
| 5 | apples | | 9 | 3 | 5 | | 9 | 3 | 5 |
|---------------| | 10 | 4 | 1 | | 10 | 4 | 2 |
| 11 | 4 | 2 | | 11 | 4 | 4 |
table clothes | 12 | 4 | 3 | | 12 | 4 | 5 |
|---------------| | 13 | 4 | 5 | |----------------------------|
| id | clothes | |--------------------------|
|----|----------|
| 1 | trousers |
| 2 | shorts |
| 3 | shirt |
| 4 | socks |
| 5 | jumper |
| 6 | jacket |
|---------------|
The tables can be recreated using the following SQL (adapted from MySQL database so may need minor tweaking to work in SQL Server):
CREATE TABLE `clothes` (
`id` int(11) unsigned NOT NULL AUTO_INCREMENT,
`clothes` varchar(32) DEFAULT NULL,
PRIMARY KEY (`id`)
);
INSERT INTO `clothes` (`id`, `clothes`)
VALUES
(1,'trousers'),
(2,'shorts'),
(3,'shirt'),
(4,'socks'),
(5,'jumper'),
(6,'jacket');
CREATE TABLE `clothes_relationships` (
`id` int(11) unsigned NOT NULL AUTO_INCREMENT,
`names_id` int(11) DEFAULT NULL,
`clothes_id` int(11) DEFAULT NULL,
PRIMARY KEY (`id`)
);
INSERT INTO `clothes_relationships` (`id`, `names_id`, `clothes_id`)
VALUES
(1,1,1),
(2,1,3),
(3,1,4),
(4,2,2),
(5,2,3),
(6,2,4),
(7,3,1),
(8,3,3),
(9,3,5),
(10,4,2),
(11,4,4),
(12,4,5);
CREATE TABLE `food_relationships` (
`id` int(11) unsigned NOT NULL AUTO_INCREMENT,
`names_id` int(11) DEFAULT NULL,
`foods_id` int(11) DEFAULT NULL,
PRIMARY KEY (`id`)
);
INSERT INTO `food_relationships` (`id`, `names_id`, `foods_id`)
VALUES
(1,1,1),
(2,1,3),
(3,1,4),
(4,2,2),
(5,2,3),
(6,2,4),
(7,2,5),
(8,3,3),
(9,3,5),
(10,4,1),
(11,4,2),
(12,4,3),
(13,4,5);
CREATE TABLE `foods` (
`id` int(11) unsigned NOT NULL AUTO_INCREMENT,
`food` varchar(32) DEFAULT NULL,
PRIMARY KEY (`id`)
);
INSERT INTO `foods` (`id`, `food`)
VALUES
(1,'beef'),
(2,'tomatoes'),
(3,'bacon'),
(4,'cheese'),
(5,'apples');
CREATE TABLE `names` (
`id` int(11) unsigned NOT NULL AUTO_INCREMENT,
`name` varchar(32) DEFAULT NULL,
PRIMARY KEY (`id`)
);
INSERT INTO `names` (`id`, `name`)
VALUES
(1,'dave'),
(2,'phil'),
(3,'john'),
(4,'pete');
I want to query the database and – somehow – get the following output:
|-------------------------------------------------------------|
| name | food | clothes |
|------|------------------------------|-----------------------|
| dave | beef,cheese,bacon | trousers,socks,shirt |
| john | apples,bacon | jumper,shirt,trousers |
| pete | beef,apples,bacon,tomatoes | shorts,jumper,socks |
| phil | bacon,tomatoes,apples,cheese | shirt,shorts,socks |
|-------------------------------------------------------------|
However, running a SELECT query that joins the ‘names’ table to one or both of the other tables (via the respective lookup tables) results in multiple rows for each name. For example:
SELECT
names.name,
foods.food
FROM
names
LEFT JOIN food_relationships ON names.id = food_relationships.names_id
LEFT JOIN foods ON food_relationships.foods_id = foods.id;
...produces the following set of results:
|-----------------|
| name | food |
|------|----------|
| dave | beef |
| dave | bacon |
| dave | cheese |
| phil | tomatoes |
| phil | bacon |
| phil | cheese |
| phil | apples |
| john | bacon |
| john | apples |
| pete | beef |
| pete | tomatoes |
| pete | bacon |
| pete | apples |
|-----------------|
The problem is compounded even further if the SELECT query returns data from both tables:
SELECT
names.name,
foods.food,
clothes.clothes
FROM
names
LEFT JOIN food_relationships ON names.id = food_relationships.names_id
LEFT JOIN foods ON food_relationships.foods_id = foods.id
LEFT JOIN clothes_relationships ON names.id = clothes_relationships.names_id
LEFT JOIN clothes ON clothes_relationships.clothes_id = clothes.id;
|-----------------------------|
| name | food | clothes |
|------|----------|-----------|
| dave | beef | trousers |
| dave | beef | shirt |
| dave | beef | socks |
| dave | bacon | trousers |
| dave | bacon | shirt |
| dave | bacon | socks |
| dave | cheese | trousers |
| dave | cheese | shirt |
| dave | cheese | socks |
| phil | tomatoes | shorts |
| phil | tomatoes | shirt |
| phil | tomatoes | socks |
| phil | bacon | shorts |
| phil | bacon | shirt |
| phil | bacon | socks |
| phil | cheese | shorts |
| phil | cheese | shirt |
| phil | cheese | socks |
| phil | apples | shorts |
| phil | apples | shirt |
| phil | apples | socks |
| ...
| etc.
The question is, how can I query the SQL Server database to retrieve all the data but process it to have only one line per person?
If the database were MySQL, the solution would be relatively easy because MySQL has a GROUP_CONCAT function which concatenates rows. So, for just one of the tables, I could use:
SELECT
names.name,
GROUP_CONCAT(foods.food)
FROM
names
LEFT JOIN food_relationships ON names.id = food_relationships.names_id
LEFT JOIN foods ON food_relationships.foods_id = foods.id
GROUP BY (names.name);
...to give:
name food
dave beef,cheese,bacon
john apples,bacon
pete beef,apples,bacon,tomatoes
phil bacon,tomatoes,apples,cheese
To get equivalent data from both ‘names’ and ‘clothes’ tables, I could use something like:
SELECT
temp_foods_table.name AS 'name',
temp_foods_table.food AS 'food',
temp_clothes_table.clothes AS 'clothes'
FROM
(
SELECT
names.name,
GROUP_CONCAT(foods.food) AS 'food'
FROM
names
LEFT JOIN food_relationships ON names.id = food_relationships.names_id
LEFT JOIN foods ON food_relationships.foods_id = foods.id
GROUP BY (names.name)
) AS temp_foods_table
LEFT JOIN
(
SELECT
names.name,
GROUP_CONCAT(clothes.clothes) AS 'clothes'
FROM
names
LEFT JOIN clothes_relationships ON names.id = clothes_relationships.names_id
LEFT JOIN clothes ON clothes_relationships.clothes_id = clothes.id
GROUP BY (names.name)
) AS temp_clothes_table
ON temp_foods_table.name = temp_clothes_table.name;
...to give the following results:
name food clothes
dave beef,cheese,bacon trousers,socks,shirt
john apples,bacon jumper,shirt,trousers
pete beef,apples,bacon,tomatoes shorts,jumper,socks
phil bacon,tomatoes,apples,cheese shirt,shorts,socks
However, the situation in SQL SERVER appears to be much less straight-forward. For a single table there are some proposed solutions online which include the use of common table expressions or FOR XML PATH. However, all the solutions appear to have drawbacks and give the distinct impression that they are work-arounds rather than specifically-designed features. Each suggested solution has some weakness (for example, the FOR XML PATH solution assumes that the text is XML and, therefore, special characters included in the text can cause problems). In addition, some commenters expressed concerns that such work-arounds were based on undocumented or deprecated features and, as a result, may not be reliable in the long-term.
As a result, I decided not to tie myself up in SQL knots but to process the data post-retrieval using Python and Pandas. I always transfer data to a Pandas dataframe for plotting and analysis anyway so this wasn't a major inconvenience. In order to concatenate the data over several columns, I used groupby(). However, since there were two many-to-many tables, there was duplication in each column and, therefore, the final concatenated string contained all those duplications. In order to have only unique values, I used Python sets (which, by definition, can only contain unique values). The only potential drawback with this method is that the order of the strings is not maintained but, for my situation, that isn't an issue. The final Python solution looked something like this:
Import necessary libraries:
>>> import pandas as pd
>>> import pymssql
>>> import getpass
Input necessary details to connect to database:
>>> myServer = input("Enter server address: ")
>>> myUser = input("Enter username: ")
>>> myPwd = getpass.getpass("Enter password: ")
Create a connection:
>>> myConnection = pymssql.connect(server=myServer, user=myUser, password=myPwd, port='1433')
Define a query to retrieve the necessary data:
>>> myQuery = """SELECT
names.name,
foods.food,
clothes.clothes
FROM
names
LEFT JOIN food_relationships ON names.id = food_relationships.names_id
LEFT JOIN foods ON food_relationships.foods_id = foods.id
LEFT JOIN clothes_relationships ON names.id = clothes_relationships.names_id
LEFT JOIN clothes ON clothes_relationships.clothes_id = clothes.id """
Run query, put results in dataframe and close connection:
>>> myLatestData = pd.io.sql.read_sql(myQuery, con=myConnection)
>>> myConnection.close()
Concatenate strings in multiple rows and remove duplicates:
>>> tempDF = tempDF.groupby('name').agg(lambda col: ','.join(set(col)))
Print final dataframe:
>>> print(tempDF)
name food clothes
dave beef,bacon,cheese socks,trousers,shirt
john bacon,apples jumper,trousers,shirt
pete tomatoes,beef,bacon,apples socks,jumper,shorts
phil tomatoes,bacon,cheese,apples socks,shorts,shirt
For me, this solution makes much more intuitive sense than trying to do all the data processing within the SQL query. Hope this helps someone else.