sql sql-server sql-server-2008 t-sql sqlperformance

Partition table but group together based on multiple columns

I have an interesting problem with partitioning a table into groups. I have a group of Tourists - each speak a single language and/or are part of a family. I need to partition the table into groups, but I want to keep families and similar language speakers together.

Let's say I want to split the Tourists into groups of up to 3 (if a group must be larger, that's acceptable). The solution doesn't have to be so smart that it fills up all groups entirely, but I'm doing a best-effort approach.

Input:

TouristID | LanguageID | FamilyID
---------------------------------
    1     |     1      |    1
    2     |     1      |    1
    3     |     1      |    1
    4     |     2      |    1
    5     |     3      |    2
    6     |     4      |    2
    7     |     5      |    3
    8     |     5      |    4
    9     |     7      |    5

Desired result:

TouristID | GroupID
-------------------
    1     |    1
    2     |    1
    3     |    1
    4     |    1
    5     |    2
    6     |    2
    7     |    3
    8     |    3
    9     |    2

Group 1 is formed by all Language 1 speakers, including the one family member that can't be left out.

Group 2 is formed by the two Family members (5, 6) and one random member (9) to make the group of 3.

Group 3 is formed by the two same language speakers (7, 8)

What I've done:

INSERT TouristGroup
SELECT
  t.TouristID,
  DENSE_RANK() OVER (ORDER BY GroupID) AS [GroupID]
FROM Tourists t
CROSS APPLY (
  SELECT MIN(TouristID) AS [GroupID]
  FROM Tourists t2
  WHERE
    ( t2.LanguageID = t.LanguageID
    OR t2.FamilyID = t.FamilyID )
) x;

INSERT Groups
SELECT GroupID, COUNT(*)
FROM TouristGroup
GROUP BY GroupID;

declare 
  @matchID int = 0,
  @currentCount int,
  @desiredCount int = 0,
  @candidateGroupID int = null,
  @chunk int = 1

while exists (
  select null
  from Groups g
  left join Matches m
    on m.GroupID = g.GroupID
  where m.GroupID is null
)
begin
  set @currentCount = null
  set @candidateGroupID = null

  select
    @currentCount = isnull(SUM([Count]), 0)
  from Matches m
  join Groups g
    on g.GroupID = m.GroupID
  where m.MatchID = @matchID

  if @CurrentCount is not null
  begin
    set @desiredCount = @chunk - @desiredCount

    select top 1
      @candidateGroupID = g.GroupID
    from Groups g
    left join Matches m
      on m.GroupID = g.GroupID
    where g.[Count] <= @desiredCount
      and m.GroupID is null
    order by [Count] DESC

    if @candidateGroupID is not null
    begin
      insert Matches
      select @matchID, @candidateGroupID
    end
    else begin
      set @matchID = @matchID + 1
    end
  end
  else begin
    set @matchid = @matchID + 1
  end
end

Question

Is there a better approach to partitioning a table, but grouping rows together based on multiple columns?

Solution

This will produce your "step 1". Maybe it's better than what you have now (no loop).

SELECT t.TouristID, DENSE_RANK() OVER (ORDER BY x.GroupNum) as GroupId
FROM Tourists t
CROSS APPLY (SELECT MIN(TouristId) AS GroupNum 
             FROM @Tourist t2 
             WHERE t2.LanguageId = t.LanguageId OR t2.FamilyId = t.FamilyId
            ) x

As for your other requirement of at least getting at least three group members, if possible, you'll probably have to do a loop similar to what you're doing (I'm not sure if it can be improved, since you haven't shared it).

[Update] Here's my proposal for "step 2":

DECLARE @MinGroupSize int = 3, @rc int = 1
WHILE @rc>0
BEGIN
    WITH GroupCount AS (
    SELECT GroupID, COUNT(*) AS GroupCount
    FROM TouristGroup
    GROUP BY GroupID
    ), CandidateGroups AS (
    SELECT TOP 1 gc1.GroupID AS ShortGroupId, singleton.GroupID as SingletonGroupID
    FROM GroupCount gc1
    CROSS APPLY (SELECT TOP 1 GroupID
                 FROM GroupCount AS gc2
                 WHERE gc2.GroupCount = 1 AND gc2.GroupID != gc1.GroupID
                 ORDER BY gc2.GroupID
                 ) AS singleton
    WHERE gc1.GroupCount < @MinGroupSize
    ORDER BY GroupCount DESC, gc1.GroupID ASC
    )
    UPDATE tg
    SET GroupID = cg.ShortGroupID
    FROM TouristGroup tg
    JOIN CandidateGroups cg ON cg.SingletonGroupID = tg.GroupID;
    SET @rc = @@ROWCOUNT;
END
--
-- If you're anal like me and want to eliminate gaps in GroupID values
--
UPDATE tg
SET GroupID = tg2.GroupID
FROM TouristGroup tg
JOIN (SELECT TouristID,  DENSE_RANK() OVER (ORDER BY GroupID) AS [GroupID]
      FROM TouristGroup) AS tg2 ON tg2.TouristID = tg.TouristID
WHERE tg.GroupID != tg2.GroupID;

This will find groups smaller than the desired minimum group size and find a singleton group (only 1 member) and update the singleton with the other GroupID, doing this one by one until there are no more candidates. The smaller groups are chosen in order (by GroupCount descending and then by GroupID ascending) so that larger groups will be filled first. Only singletons are chosen for updating so that no natural groups will be broken up.