Search code examples
algorithmsascluster-analysis

Clusterization algorithm


I have problem with clusterization of clients.

I have a dataset with columns such as name, address, email, phone, etc. (in a example A,B,C). Each row has unique identifier (ID). I need to assign CLUSTER_ID (X) to each row. In one cluster all rows have one or more the same attributes as other rows. So clients with ID=1,2,3 have the same A attribute and clients with ID=3,10 have the same B attribute then ID=1,2,3,10 should be in the same cluster.

How can I solve this problem using SQL? If it's not possible how to write the algorithm (pseudocode)? The performance is very important, because the dataset contains milions of rows.

Sample Input:

ID  A   B   C
1   A1  B3  C1
2   A1  B2  C5
3   A1  B10 C10
4   A2  B1  C5
5   A2  B8  C1
6   A3  B1  C4
7   A4  B6  C3
8   A4  B3  C5
9   A5  B7  C2
10  A6  B10 C3
11  A8  B5  C4

Sample Output:

ID  A   B   C   X
1   A1  B3  C1  1
2   A1  B2  C5  1
3   A1  B10 C10 1
4   A2  B1  C5  1
5   A2  B8  C1  1
6   A3  B1  C4  1
7   A4  B6  C3  1
8   A4  B3  C5  1
9   A5  B7  C2  2
10  A6  B10 C3  1
11  A8  B5  C4  1

Thanks for any help.


Solution

  • A possible way is by repeating updates for the empty X.

    Start with cluster_id 1. F.e. by using a variable.

    SET @CurrentClusterID = 1
    

    Take the top 1 record, and update it's X to 1.

    Now loop an update for all records with an empty X, and that can be linked to a record with X = 1 and that has the same A or B or C

    Disclaimer:
    The statement will vary depending on the RDBMS.
    This is just intended as pseudo-code.

    WHILE (<<some check to see if there were records updated>>) 
    BEGIN
      UPDATE yourtable t
      SET t.X = @CurrentClusterID
      WHERE t.X IS NULL
        AND EXISTS (
          SELECT 1 FROM yourtable d 
          WHERE d.X =  @CurrentClusterID
            AND (d.A = t.A OR d.B = t.B OR d.C = t.C)
      );
    END
    

    Loop that till it updates 0 records.

    Now repeat the method for the other clusters, till there are no more empty X in the table.

    1) Increase the @CurrentClusterID by 1
    2) Update the next top 1 record with an empty X to the new @CurrentClusterID
    3) Loop the update till no-more updates were done.

    An example test on db<>fiddle here for MS Sql Server.