Search code examples
sqlsql-serversql-server-2008t-sqlsqlperformance

Partition table but group together based on multiple columns


I have an interesting problem with partitioning a table into groups. I have a group of Tourists - each speak a single language and/or are part of a family. I need to partition the table into groups, but I want to keep families and similar language speakers together.

Let's say I want to split the Tourists into groups of up to 3 (if a group must be larger, that's acceptable). The solution doesn't have to be so smart that it fills up all groups entirely, but I'm doing a best-effort approach.

Input:

TouristID | LanguageID | FamilyID
---------------------------------
    1     |     1      |    1
    2     |     1      |    1
    3     |     1      |    1
    4     |     2      |    1
    5     |     3      |    2
    6     |     4      |    2
    7     |     5      |    3
    8     |     5      |    4
    9     |     7      |    5

Desired result:

TouristID | GroupID
-------------------
    1     |    1
    2     |    1
    3     |    1
    4     |    1
    5     |    2
    6     |    2
    7     |    3
    8     |    3
    9     |    2

Group 1 is formed by all Language 1 speakers, including the one family member that can't be left out.

Group 2 is formed by the two Family members (5, 6) and one random member (9) to make the group of 3.

Group 3 is formed by the two same language speakers (7, 8)

What I've done:

INSERT TouristGroup
SELECT
  t.TouristID,
  DENSE_RANK() OVER (ORDER BY GroupID) AS [GroupID]
FROM Tourists t
CROSS APPLY (
  SELECT MIN(TouristID) AS [GroupID]
  FROM Tourists t2
  WHERE
    ( t2.LanguageID = t.LanguageID
    OR t2.FamilyID = t.FamilyID )
) x;

INSERT Groups
SELECT GroupID, COUNT(*)
FROM TouristGroup
GROUP BY GroupID;

declare 
  @matchID int = 0,
  @currentCount int,
  @desiredCount int = 0,
  @candidateGroupID int = null,
  @chunk int = 1

while exists (
  select null
  from Groups g
  left join Matches m
    on m.GroupID = g.GroupID
  where m.GroupID is null
)
begin
  set @currentCount = null
  set @candidateGroupID = null

  select
    @currentCount = isnull(SUM([Count]), 0)
  from Matches m
  join Groups g
    on g.GroupID = m.GroupID
  where m.MatchID = @matchID

  if @CurrentCount is not null
  begin
    set @desiredCount = @chunk - @desiredCount

    select top 1
      @candidateGroupID = g.GroupID
    from Groups g
    left join Matches m
      on m.GroupID = g.GroupID
    where g.[Count] <= @desiredCount
      and m.GroupID is null
    order by [Count] DESC

    if @candidateGroupID is not null
    begin
      insert Matches
      select @matchID, @candidateGroupID
    end
    else begin
      set @matchID = @matchID + 1
    end
  end
  else begin
    set @matchid = @matchID + 1
  end
end         

Question

Is there a better approach to partitioning a table, but grouping rows together based on multiple columns?


Solution

  • This will produce your "step 1". Maybe it's better than what you have now (no loop).

    SELECT t.TouristID, DENSE_RANK() OVER (ORDER BY x.GroupNum) as GroupId
    FROM Tourists t
    CROSS APPLY (SELECT MIN(TouristId) AS GroupNum 
                 FROM @Tourist t2 
                 WHERE t2.LanguageId = t.LanguageId OR t2.FamilyId = t.FamilyId
                ) x
    

    As for your other requirement of at least getting at least three group members, if possible, you'll probably have to do a loop similar to what you're doing (I'm not sure if it can be improved, since you haven't shared it).

    [Update] Here's my proposal for "step 2":

    DECLARE @MinGroupSize int = 3, @rc int = 1
    WHILE @rc>0
    BEGIN
        WITH GroupCount AS (
        SELECT GroupID, COUNT(*) AS GroupCount
        FROM TouristGroup
        GROUP BY GroupID
        ), CandidateGroups AS (
        SELECT TOP 1 gc1.GroupID AS ShortGroupId, singleton.GroupID as SingletonGroupID
        FROM GroupCount gc1
        CROSS APPLY (SELECT TOP 1 GroupID
                     FROM GroupCount AS gc2
                     WHERE gc2.GroupCount = 1 AND gc2.GroupID != gc1.GroupID
                     ORDER BY gc2.GroupID
                     ) AS singleton
        WHERE gc1.GroupCount < @MinGroupSize
        ORDER BY GroupCount DESC, gc1.GroupID ASC
        )
        UPDATE tg
        SET GroupID = cg.ShortGroupID
        FROM TouristGroup tg
        JOIN CandidateGroups cg ON cg.SingletonGroupID = tg.GroupID;
        SET @rc = @@ROWCOUNT;
    END
    --
    -- If you're anal like me and want to eliminate gaps in GroupID values
    --
    UPDATE tg
    SET GroupID = tg2.GroupID
    FROM TouristGroup tg
    JOIN (SELECT TouristID,  DENSE_RANK() OVER (ORDER BY GroupID) AS [GroupID]
          FROM TouristGroup) AS tg2 ON tg2.TouristID = tg.TouristID
    WHERE tg.GroupID != tg2.GroupID;
    

    This will find groups smaller than the desired minimum group size and find a singleton group (only 1 member) and update the singleton with the other GroupID, doing this one by one until there are no more candidates. The smaller groups are chosen in order (by GroupCount descending and then by GroupID ascending) so that larger groups will be filled first. Only singletons are chosen for updating so that no natural groups will be broken up.