I am working on SQL server.
I have the following table:
For each BIGroup I have a multiple VarianceName. For each VarianceName I have multiple PartNumbers. I am comparing every partnumber with the other partnumbers within the same BIGroup and VarianceName, and writting the number of differences between PartNumber1 and PartNumber2 in the column Difference:
+---------+--------------+-------------+-------------+------------+-----------+
| BIGroup | VarianceName | PartNumber1 | PartNumber2 | Difference | Cluster |
+---------+--------------+-------------+-------------+------------+-----------+
| D934 | A | 11426777 | 11426777 | 0 | |
| D934 | A | 11426777 | 11426781 | 0 | |
| D934 | A | 11426777 | 12542804 | 2 | |
| D934 | A | 11426777 | 12554759 | 4 | |
| D934 | A | 11426777 | 12564258 | 0 | |
| D934 | A | 11426781 | 11426777 | 0 | |
| D934 | A | 11426781 | 11426781 | 0 | |
| D934 | A | 11426781 | 12542804 | 5 | |
| D934 | A | 11426781 | 12554759 | 1 | |
| D934 | A | 11426781 | 12564258 | 0 | |
| D934 | A | 12542804 | 11426777 | 2 | |
| D934 | A | 12542804 | 11426781 | 5 | |
| D934 | A | 12542804 | 12542804 | 0 | |
| D934 | A | 12542804 | 12554759 | 0 | |
| D934 | A | 12542804 | 12564258 | 8 | |
| D934 | A | 12554759 | 11426777 | 4 | |
| D934 | A | 12554759 | 11426781 | 1 | |
| D934 | A | 12554759 | 12542804 | 0 | |
| D934 | A | 12554759 | 12554759 | 0 | |
| D934 | A | 12554759 | 12564258 | 9 | |
| D934 | A | 12564258 | 11426777 | 0 | |
| D934 | A | 12564258 | 11426781 | 0 | |
| D934 | A | 12564258 | 12542804 | 8 | |
| D934 | A | 12564258 | 12554759 | 9 | |
| D934 | A | 12564258 | 12564258 | 0 | |
| D934 | AA | 11438878 | 11438878 | 0 | |
| D934 | AB | 11438924 | 11438924 | 0 | |
| D934 | AC | 12556213 | 12556213 | 0 | |
| D934 | AC | 12556213 | 12556214 | 5 | |
| D934 | AC | 12556214 | 12556213 | 5 | |
| D934 | AC | 12556214 | 12556214 | 0 | |
| D955 | A | 75346846 | 75346846 | 0 | |
| ... | ... | ... | ... | 0 | |
+---------+--------------+-------------+-------------+------------+-----------+
ex: For D934, for VarianceName A, PartNumbers 11426777, 11426781 and 12564258 are identical because there is 0 differences between : 11426777 and 11426781, 11426781 and 12564258, and 12564258 and 11426777.
ex: For D934, for VarianceName A, PartNumbers 12542804 and 12554759 are identical because there is 0 differences between: 12542804 and 12554759.
My Goal is to identify all the group of identical PartNumbers within the same BIGroup and VarianceName. To flag those groups, I will use the column called Cluster.
So 11426777, 11426781 and 12564258 would belong to Cluster D934-A-C1.
So 12542804 and 12554759 would belong to cluster D934-A-C2.
What should be the query/stored procedure to update the Cluster column, to obtain the following result:
+---------+--------------+-------------+-------------+------------+-----------+
| BIGroup | VarianceName | PartNumber1 | PartNumber2 | Difference | Cluster |
+---------+--------------+-------------+-------------+------------+-----------+
| D934 | A | 11426777 | 11426777 | 0 | D934-A-C1 |
| D934 | A | 11426777 | 11426781 | 0 | D934-A-C1 |
| D934 | A | 11426777 | 12542804 | 2 | |
| D934 | A | 11426777 | 12554759 | 4 | |
| D934 | A | 11426777 | 12564258 | 0 | D934-A-C1 |
| D934 | A | 11426781 | 11426777 | 0 | D934-A-C1 |
| D934 | A | 11426781 | 11426781 | 0 | D934-A-C1 |
| D934 | A | 11426781 | 12542804 | 5 | |
| D934 | A | 11426781 | 12554759 | 1 | |
| D934 | A | 11426781 | 12564258 | 0 | D934-A-C1 |
| D934 | A | 12542804 | 11426777 | 2 | |
| D934 | A | 12542804 | 11426781 | 5 | |
| D934 | A | 12542804 | 12542804 | 0 | D934-A-C2 |
| D934 | A | 12542804 | 12554759 | 0 | D934-A-C2 |
| D934 | A | 12542804 | 12564258 | 8 | |
| D934 | A | 12554759 | 11426777 | 4 | |
| D934 | A | 12554759 | 11426781 | 1 | |
| D934 | A | 12554759 | 12542804 | 0 | D934-A-C2 |
| D934 | A | 12554759 | 12554759 | 0 | D934-A-C2 |
| D934 | A | 12554759 | 12564258 | 9 | |
| D934 | A | 12564258 | 11426777 | 0 | D934-A-C1 |
| D934 | A | 12564258 | 11426781 | 0 | D934-A-C1 |
| D934 | A | 12564258 | 12542804 | 8 | |
| D934 | A | 12564258 | 12554759 | 9 | |
| D934 | A | 12564258 | 12564258 | 0 | D934-A-C1 |
And so on for the other VarianceName
| D934 | AA | 11438878 | 11438878 | 0 | D934-AA-C1
| D934 | AB | 11438924 | 11438924 | 0 | D934-AB-C1
| D934 | AC | 12556213 | 12556213 | 0 | D934-AC-C1
| D934 | AC | 12556213 | 12556214 | 5 |
| D934 | AC | 12556214 | 12556213 | 5 |
| D934 | AC | 12556214 | 12556214 | 0 | D934-AC-C1
And so on for the other BiGroup
| D955 | A | 75346846 | 75346846 | 0 | D955-A-C1
| ... | ... | ... | ... | ... |
+---------+--------------+-------------+-------------+------------+-----------+
The column should be left to NULL if Difference > 0
Here is the script to have the data as a cte:
with t1 as
(
select 'D934' as BIGroup ,'A' as VarianceName , 11426777 as PartNumber1, 11426777 as PartNumber2, 0 as Difference, null as Cluster
union select 'D934' ,'A' , 11426777 , 11426781 , 0 , null
union select 'D934' ,'A' , 11426777 , 12542804 , 2 , null
union select 'D934' ,'A' , 11426777 , 12554759 , 4 , null
union select 'D934' ,'A' , 11426777 , 12564258 , 0 , null
union select 'D934' ,'A' , 11426781 , 11426777 , 0 , null
union select 'D934' ,'A' , 11426781 , 11426781 , 0 , null
union select 'D934' ,'A' , 11426781 , 12542804 , 5 , null
union select 'D934' ,'A' , 11426781 , 12554759 , 1 , null
union select 'D934' ,'A' , 11426781 , 12564258 , 0 , null
union select 'D934' ,'A' , 12542804 , 11426777 , 2 , null
union select 'D934' ,'A' , 12542804 , 11426781 , 5 , null
union select 'D934' ,'A' , 12542804 , 12542804 , 0 , null
union select 'D934' ,'A' , 12542804 , 12554759 , 0 , null
union select 'D934' ,'A' , 12542804 , 12564258 , 8 , null
union select 'D934' ,'A' , 12554759 , 11426777 , 4 , null
union select 'D934' ,'A' , 12554759 , 11426781 , 1 , null
union select 'D934' ,'A' , 12554759 , 12542804 , 0 , null
union select 'D934' ,'A' , 12554759 , 12554759 , 0 , null
union select 'D934' ,'A' , 12554759 , 12564258 , 9 , null
union select 'D934' ,'A' , 12564258 , 11426777 , 0 , null
union select 'D934' ,'A' , 12564258 , 11426781 , 0 , null
union select 'D934' ,'A' , 12564258 , 12542804 , 8 , null
union select 'D934' ,'A' , 12564258 , 12554759 , 9 , null
union select 'D934' ,'A' , 12564258 , 12564258 , 0 , null
union select 'D934' ,'AA' , 11438878 , 11438878 , 0 , null
union select 'D934' ,'AB' , 11438924 , 11438924 , 0 , null
union select 'D934' ,'AC' , 12556213 , 12556213 , 0 , null
union select 'D934' ,'AC' , 12556213 , 12556214 , 5 , null
union select 'D934' ,'AC' , 12556214 , 12556213 , 5 , null
union select 'D934' ,'AC' , 12556214 , 12556214 , 0 , null
union select 'D955' ,'A' , 75346846 , 75346846 , 0 , null
)
Edit:
To better understand the problem, I drew the 5 partnumbers of D934
A
, their links, and the two clusters.
The links we are interested in are the black ones (because it means there is 0 differences between the partNumbers).
The orange links are representing difference>0 between the partNumbers.
After drawing the links we can identify 2 clusters, which I drew with red circles.
I managed to solve this problem with a stored procedure:
DECLARE @BiGroup [nvarchar](30);
DECLARE @VarianceName [nvarchar](30);
DECLARE @NewBiGroup [nvarchar](30);
DECLARE @NewVarianceName [nvarchar](30);
DECLARE @PartNumber [nvarchar](30);
DECLARE @ClusterName [nvarchar](30);
DECLARE @IncrementClusterName [nvarchar](30);
set @BiGroup = 'first_BiGroup';
set @VarianceName = 'first_VarianceName';
set @IncrementClusterName = 1;
set @ClusterName = null;
-- Declare cursor
DECLARE cur CURSOR READ_ONLY FOR
Select [PartNumber1] FROM t1
order by [BIGroup] ,[VarianceName] ,[PartNumber1];
--clean cluster column
update t1 set [Cluster]=null;
OPEN cur
FETCH NEXT FROM cur INTO @PartNumber
-- Loop on every PartNumber
WHILE @@FETCH_STATUS = 0
BEGIN
--set NewBiGroup and NewPartNumber
set @NewBiGroup = (select Top(1) [BIGroup] from t1 where partnumber1 = @PartNumber);
set @NewVarianceName = (select Top(1) [VarianceName] from t1 where partnumber1 = @PartNumber);
--check if we are still in the same BIGroup and Variance, otherwise, reset the cluster increment
if @NewBiGroup <> @BiGroup or @NewVarianceName <> @VarianceName
BEGIN
set @IncrementClusterName = 1;
END
--get the clusterName of this partNumber, if it exists
set @ClusterName = (select Top(1) [Cluster] from t1 where partnumber2 = @PartNumber and [Cluster] is not null);
--if ClusterName is NULL, put a clustername and then increment the @IncrementClusterName,
--otherwise set the cluster to @ClusterName
if @ClusterName is null
BEGIN
update t1 set [Cluster] = @NewBiGroup+'-'+@NewVarianceName+'-'+@IncrementClusterName
where partnumber1 = @PartNumber
and Difference= 0 ;
set @IncrementClusterName = @IncrementClusterName +1;
END
else
BEGIN
update t1 set [Cluster] = @NewBiGroup+'-'+@NewVarianceName+'-'+@ClusterName
where partnumber1 = @PartNumber
and Difference= 0 ;
END
-- setting the BiGroup and VarianceName
set @BiGroup = @NewBiGroup;
set @VarianceName = @NewVarianceName;
FETCH NEXT FROM cur INTO @PartNumber
END
CLOSE cur
DEALLOCATE cur
The algorithm of the stored procedure works like this:
PartNumber
with Difference=0
BiGroup
or VarianceName
has changed
@clusterIncrement
to 1Cluster
Cluster
to @clusterIncrement
@clusterIncrement = @clusterIncrement +1
Cluster
Cluster
to the existing Cluster