I have an interesting problem with partitioning a table into groups. I have a group of Tourists - each speak a single language and/or are part of a family. I need to partition the table into groups, but I want to keep families and similar language speakers together.
Let's say I want to split the Tourists into groups of up to 3 (if a group must be larger, that's acceptable). The solution doesn't have to be so smart that it fills up all groups entirely, but I'm doing a best-effort approach.
Input:
TouristID | LanguageID | FamilyID
---------------------------------
1 | 1 | 1
2 | 1 | 1
3 | 1 | 1
4 | 2 | 1
5 | 3 | 2
6 | 4 | 2
7 | 5 | 3
8 | 5 | 4
9 | 7 | 5
Desired result:
TouristID | GroupID
-------------------
1 | 1
2 | 1
3 | 1
4 | 1
5 | 2
6 | 2
7 | 3
8 | 3
9 | 2
Group 1 is formed by all Language 1 speakers, including the one family member that can't be left out.
Group 2 is formed by the two Family members (5, 6) and one random member (9) to make the group of 3.
Group 3 is formed by the two same language speakers (7, 8)
What I've done:
INSERT TouristGroup
SELECT
t.TouristID,
DENSE_RANK() OVER (ORDER BY GroupID) AS [GroupID]
FROM Tourists t
CROSS APPLY (
SELECT MIN(TouristID) AS [GroupID]
FROM Tourists t2
WHERE
( t2.LanguageID = t.LanguageID
OR t2.FamilyID = t.FamilyID )
) x;
INSERT Groups
SELECT GroupID, COUNT(*)
FROM TouristGroup
GROUP BY GroupID;
declare
@matchID int = 0,
@currentCount int,
@desiredCount int = 0,
@candidateGroupID int = null,
@chunk int = 1
while exists (
select null
from Groups g
left join Matches m
on m.GroupID = g.GroupID
where m.GroupID is null
)
begin
set @currentCount = null
set @candidateGroupID = null
select
@currentCount = isnull(SUM([Count]), 0)
from Matches m
join Groups g
on g.GroupID = m.GroupID
where m.MatchID = @matchID
if @CurrentCount is not null
begin
set @desiredCount = @chunk - @desiredCount
select top 1
@candidateGroupID = g.GroupID
from Groups g
left join Matches m
on m.GroupID = g.GroupID
where g.[Count] <= @desiredCount
and m.GroupID is null
order by [Count] DESC
if @candidateGroupID is not null
begin
insert Matches
select @matchID, @candidateGroupID
end
else begin
set @matchID = @matchID + 1
end
end
else begin
set @matchid = @matchID + 1
end
end
Question
Is there a better approach to partitioning a table, but grouping rows together based on multiple columns?
This will produce your "step 1". Maybe it's better than what you have now (no loop).
SELECT t.TouristID, DENSE_RANK() OVER (ORDER BY x.GroupNum) as GroupId
FROM Tourists t
CROSS APPLY (SELECT MIN(TouristId) AS GroupNum
FROM @Tourist t2
WHERE t2.LanguageId = t.LanguageId OR t2.FamilyId = t.FamilyId
) x
As for your other requirement of at least getting at least three group members, if possible, you'll probably have to do a loop similar to what you're doing (I'm not sure if it can be improved, since you haven't shared it).
[Update] Here's my proposal for "step 2":
DECLARE @MinGroupSize int = 3, @rc int = 1
WHILE @rc>0
BEGIN
WITH GroupCount AS (
SELECT GroupID, COUNT(*) AS GroupCount
FROM TouristGroup
GROUP BY GroupID
), CandidateGroups AS (
SELECT TOP 1 gc1.GroupID AS ShortGroupId, singleton.GroupID as SingletonGroupID
FROM GroupCount gc1
CROSS APPLY (SELECT TOP 1 GroupID
FROM GroupCount AS gc2
WHERE gc2.GroupCount = 1 AND gc2.GroupID != gc1.GroupID
ORDER BY gc2.GroupID
) AS singleton
WHERE gc1.GroupCount < @MinGroupSize
ORDER BY GroupCount DESC, gc1.GroupID ASC
)
UPDATE tg
SET GroupID = cg.ShortGroupID
FROM TouristGroup tg
JOIN CandidateGroups cg ON cg.SingletonGroupID = tg.GroupID;
SET @rc = @@ROWCOUNT;
END
--
-- If you're anal like me and want to eliminate gaps in GroupID values
--
UPDATE tg
SET GroupID = tg2.GroupID
FROM TouristGroup tg
JOIN (SELECT TouristID, DENSE_RANK() OVER (ORDER BY GroupID) AS [GroupID]
FROM TouristGroup) AS tg2 ON tg2.TouristID = tg.TouristID
WHERE tg.GroupID != tg2.GroupID;
This will find groups smaller than the desired minimum group size and find a singleton group (only 1 member) and update the singleton with the other GroupID, doing this one by one until there are no more candidates. The smaller groups are chosen in order (by GroupCount descending and then by GroupID ascending) so that larger groups will be filled first. Only singletons are chosen for updating so that no natural groups will be broken up.