Search code examples
sqlsql-serveralgorithmcluster-analysisdata-mining

Grouping the rows by similarities


I am working on SQL server.

I have the following table:

For each BIGroup I have a multiple VarianceName. For each VarianceName I have multiple PartNumbers. I am comparing every partnumber with the other partnumbers within the same BIGroup and VarianceName, and writting the number of differences between PartNumber1 and PartNumber2 in the column Difference:

+---------+--------------+-------------+-------------+------------+-----------+
| BIGroup | VarianceName | PartNumber1 | PartNumber2 | Difference |  Cluster  |
+---------+--------------+-------------+-------------+------------+-----------+
| D934    | A            |    11426777 |    11426777 |          0 |           |
| D934    | A            |    11426777 |    11426781 |          0 |           |
| D934    | A            |    11426777 |    12542804 |          2 |           |
| D934    | A            |    11426777 |    12554759 |          4 |           |
| D934    | A            |    11426777 |    12564258 |          0 |           |
| D934    | A            |    11426781 |    11426777 |          0 |           |
| D934    | A            |    11426781 |    11426781 |          0 |           |
| D934    | A            |    11426781 |    12542804 |          5 |           |
| D934    | A            |    11426781 |    12554759 |          1 |           |
| D934    | A            |    11426781 |    12564258 |          0 |           |
| D934    | A            |    12542804 |    11426777 |          2 |           |
| D934    | A            |    12542804 |    11426781 |          5 |           |
| D934    | A            |    12542804 |    12542804 |          0 |           |
| D934    | A            |    12542804 |    12554759 |          0 |           |
| D934    | A            |    12542804 |    12564258 |          8 |           |
| D934    | A            |    12554759 |    11426777 |          4 |           |
| D934    | A            |    12554759 |    11426781 |          1 |           |
| D934    | A            |    12554759 |    12542804 |          0 |           |
| D934    | A            |    12554759 |    12554759 |          0 |           |
| D934    | A            |    12554759 |    12564258 |          9 |           |
| D934    | A            |    12564258 |    11426777 |          0 |           |
| D934    | A            |    12564258 |    11426781 |          0 |           |
| D934    | A            |    12564258 |    12542804 |          8 |           |
| D934    | A            |    12564258 |    12554759 |          9 |           |
| D934    | A            |    12564258 |    12564258 |          0 |           |
| D934    | AA           |    11438878 |    11438878 |          0 |           |
| D934    | AB           |    11438924 |    11438924 |          0 |           |
| D934    | AC           |    12556213 |    12556213 |          0 |           |
| D934    | AC           |    12556213 |    12556214 |          5 |           |
| D934    | AC           |    12556214 |    12556213 |          5 |           |
| D934    | AC           |    12556214 |    12556214 |          0 |           |
| D955    | A            |    75346846 |    75346846 |          0 |           |
| ...     | ...          |    ...      |    ...      |          0 |           |
+---------+--------------+-------------+-------------+------------+-----------+

ex: For D934, for VarianceName A, PartNumbers 11426777, 11426781 and 12564258 are identical because there is 0 differences between : 11426777 and 11426781, 11426781 and 12564258, and 12564258 and 11426777.

ex: For D934, for VarianceName A, PartNumbers 12542804 and 12554759 are identical because there is 0 differences between: 12542804 and 12554759.

My Goal is to identify all the group of identical PartNumbers within the same BIGroup and VarianceName. To flag those groups, I will use the column called Cluster.

So 11426777, 11426781 and 12564258 would belong to Cluster D934-A-C1.

So 12542804 and 12554759 would belong to cluster D934-A-C2.

What should be the query/stored procedure to update the Cluster column, to obtain the following result:

+---------+--------------+-------------+-------------+------------+-----------+
| BIGroup | VarianceName | PartNumber1 | PartNumber2 | Difference |  Cluster  |
+---------+--------------+-------------+-------------+------------+-----------+
| D934    | A            |    11426777 |    11426777 |          0 | D934-A-C1 |
| D934    | A            |    11426777 |    11426781 |          0 | D934-A-C1 |
| D934    | A            |    11426777 |    12542804 |          2 |           |
| D934    | A            |    11426777 |    12554759 |          4 |           |
| D934    | A            |    11426777 |    12564258 |          0 | D934-A-C1 |
| D934    | A            |    11426781 |    11426777 |          0 | D934-A-C1 |
| D934    | A            |    11426781 |    11426781 |          0 | D934-A-C1 |
| D934    | A            |    11426781 |    12542804 |          5 |           |
| D934    | A            |    11426781 |    12554759 |          1 |           |
| D934    | A            |    11426781 |    12564258 |          0 | D934-A-C1 |
| D934    | A            |    12542804 |    11426777 |          2 |           |
| D934    | A            |    12542804 |    11426781 |          5 |           |
| D934    | A            |    12542804 |    12542804 |          0 | D934-A-C2 |
| D934    | A            |    12542804 |    12554759 |          0 | D934-A-C2 |
| D934    | A            |    12542804 |    12564258 |          8 |           |
| D934    | A            |    12554759 |    11426777 |          4 |           |
| D934    | A            |    12554759 |    11426781 |          1 |           |
| D934    | A            |    12554759 |    12542804 |          0 | D934-A-C2 |
| D934    | A            |    12554759 |    12554759 |          0 | D934-A-C2 |
| D934    | A            |    12554759 |    12564258 |          9 |           |
| D934    | A            |    12564258 |    11426777 |          0 | D934-A-C1 |
| D934    | A            |    12564258 |    11426781 |          0 | D934-A-C1 |
| D934    | A            |    12564258 |    12542804 |          8 |           |
| D934    | A            |    12564258 |    12554759 |          9 |           |
| D934    | A            |    12564258 |    12564258 |          0 | D934-A-C1 |

And so on for the other VarianceName

| D934    | AA           |    11438878 |    11438878 |          0 | D934-AA-C1  
| D934    | AB           |    11438924 |    11438924 |          0 | D934-AB-C1
| D934    | AC           |    12556213 |    12556213 |          0 | D934-AC-C1
| D934    | AC           |    12556213 |    12556214 |          5 |
| D934    | AC           |    12556214 |    12556213 |          5 |
| D934    | AC           |    12556214 |    12556214 |          0 | D934-AC-C1

And so on for the other BiGroup

| D955    | A            |    75346846 |    75346846 |          0 | D955-A-C1
| ...     | ...          |    ...      |    ...      |        ... |
+---------+--------------+-------------+-------------+------------+-----------+

The column should be left to NULL if Difference > 0

Here is the script to have the data as a cte:

with t1 as
( 
select 'D934'  as BIGroup  ,'A'    as VarianceName  ,   11426777 as PartNumber1,   11426777 as PartNumber2,         0 as Difference,  null as Cluster        
union select 'D934'    ,'A'            ,   11426777 ,   11426781 ,         0 , null
union select 'D934'    ,'A'            ,   11426777 ,   12542804 ,         2 , null
union select 'D934'    ,'A'            ,   11426777 ,   12554759 ,         4 , null
union select 'D934'    ,'A'            ,   11426777 ,   12564258 ,         0 , null
union select 'D934'    ,'A'            ,   11426781 ,   11426777 ,         0 , null
union select 'D934'    ,'A'            ,   11426781 ,   11426781 ,         0 , null
union select 'D934'    ,'A'            ,   11426781 ,   12542804 ,         5 , null
union select 'D934'    ,'A'            ,   11426781 ,   12554759 ,         1 , null
union select 'D934'    ,'A'            ,   11426781 ,   12564258 ,         0 , null
union select 'D934'    ,'A'            ,   12542804 ,   11426777 ,         2 , null
union select 'D934'    ,'A'            ,   12542804 ,   11426781 ,         5 , null
union select 'D934'    ,'A'            ,   12542804 ,   12542804 ,         0 , null
union select 'D934'    ,'A'            ,   12542804 ,   12554759 ,         0 , null
union select 'D934'    ,'A'            ,   12542804 ,   12564258 ,         8 , null
union select 'D934'    ,'A'            ,   12554759 ,   11426777 ,         4 , null
union select 'D934'    ,'A'            ,   12554759 ,   11426781 ,         1 , null
union select 'D934'    ,'A'            ,   12554759 ,   12542804 ,         0 , null
union select 'D934'    ,'A'            ,   12554759 ,   12554759 ,         0 , null
union select 'D934'    ,'A'            ,   12554759 ,   12564258 ,         9 , null
union select 'D934'    ,'A'            ,   12564258 ,   11426777 ,         0 , null
union select 'D934'    ,'A'            ,   12564258 ,   11426781 ,         0 , null
union select 'D934'    ,'A'            ,   12564258 ,   12542804 ,         8 , null
union select 'D934'    ,'A'            ,   12564258 ,   12554759 ,         9 , null
union select 'D934'    ,'A'            ,   12564258 ,   12564258 ,         0 , null
union select 'D934'    ,'AA'           ,   11438878 ,   11438878 ,         0 , null
union select 'D934'    ,'AB'           ,   11438924 ,   11438924 ,         0 , null
union select 'D934'    ,'AC'           ,   12556213 ,   12556213 ,         0 , null
union select 'D934'    ,'AC'           ,   12556213 ,   12556214 ,         5 , null
union select 'D934'    ,'AC'           ,   12556214 ,   12556213 ,         5 , null
union select 'D934'    ,'AC'           ,   12556214 ,   12556214 ,         0 , null
union select 'D955'    ,'A'            ,   75346846 ,   75346846 ,         0 , null
)

Edit: To better understand the problem, I drew the 5 partnumbers of D934 A, their links, and the two clusters.

The links we are interested in are the black ones (because it means there is 0 differences between the partNumbers).

The orange links are representing difference>0 between the partNumbers.

After drawing the links we can identify 2 clusters, which I drew with red circles.

enter image description here


Solution

  • I managed to solve this problem with a stored procedure:

    DECLARE @BiGroup                    [nvarchar](30);
    DECLARE @VarianceName               [nvarchar](30);
    DECLARE @NewBiGroup                 [nvarchar](30);
    DECLARE @NewVarianceName            [nvarchar](30);
    DECLARE @PartNumber                 [nvarchar](30);
    DECLARE @ClusterName                [nvarchar](30);
    DECLARE @IncrementClusterName       [nvarchar](30);
    
    set @BiGroup = 'first_BiGroup';
    set @VarianceName = 'first_VarianceName';
    set @IncrementClusterName = 1;
    set @ClusterName = null;
    
    -- Declare cursor
    DECLARE cur CURSOR READ_ONLY FOR
    Select [PartNumber1] FROM t1
    order by [BIGroup] ,[VarianceName] ,[PartNumber1];
    
    --clean cluster column
    update t1 set [Cluster]=null;
    
    OPEN cur
    FETCH NEXT FROM cur INTO @PartNumber
    
    -- Loop on every PartNumber
    WHILE @@FETCH_STATUS = 0
    BEGIN
    
      --set NewBiGroup and NewPartNumber
      set @NewBiGroup = (select Top(1) [BIGroup] from t1 where partnumber1 = @PartNumber);
      set @NewVarianceName = (select Top(1) [VarianceName] from t1 where partnumber1 = @PartNumber);
    
      --check if we are still in the same BIGroup and Variance, otherwise, reset the cluster increment
      if @NewBiGroup <> @BiGroup or @NewVarianceName <> @VarianceName 
      BEGIN
        set @IncrementClusterName = 1;
      END
    
      --get the clusterName of this partNumber, if it exists
      set @ClusterName = (select Top(1) [Cluster] from t1 where partnumber2 = @PartNumber and [Cluster] is not null);
    
      --if ClusterName is NULL, put a clustername and then increment the @IncrementClusterName, 
      --otherwise set the cluster to @ClusterName
      if @ClusterName is null
      BEGIN
        update t1 set [Cluster] = @NewBiGroup+'-'+@NewVarianceName+'-'+@IncrementClusterName
        where partnumber1 = @PartNumber  
        and Difference= 0 ;
    
        set @IncrementClusterName = @IncrementClusterName +1;
      END
      else
      BEGIN
        update t1 set [Cluster] = @NewBiGroup+'-'+@NewVarianceName+'-'+@ClusterName
        where partnumber1 = @PartNumber  
        and Difference= 0 ;
      END
    
      -- setting the BiGroup and VarianceName
      set @BiGroup = @NewBiGroup;
      set @VarianceName = @NewVarianceName;
    
     FETCH NEXT FROM cur INTO @PartNumber
    END
    
    CLOSE cur
    DEALLOCATE cur
    

    The algorithm of the stored procedure works like this:

    • For every PartNumber with Difference=0
      • If the BiGroup or VarianceName has changed
        • I reset the @clusterIncrement to 1
      • If he isn't already part of a Cluster
        • I set his Cluster to @clusterIncrement
        • @clusterIncrement = @clusterIncrement +1
      • If he is already part of a Cluster
        • I set his Cluster to the existing Cluster