I am trying to understand this algorithm, but not able to get proper documents and explanations. Can someone please help me understand this clustering algorithm.
Posting the answer so that it is helpful to others.
Leader algorithm is a incremental clustering algorithm generally used to cluster large data sets. This algorithm is order dependent and may form different clusters based on the order the data set is provided to the algorithm. The algorithm consists of the following steps.
Step 1: Assign the first data item, P1 to cluster C1. This data set will be the leader of the cluster C1.
Step 2:Now move to the next data item say P2 and calculate its distance from the leader P1. If the distance between P2 and leader P1 is less than a user specified threshold (t) then data point P2 is assigned to this cluster (Cluster C1). If the distance between leader P1 and data item P2 is more than the user specified threshold t, then form a new cluster C2 and assign P2 to this new cluster. P2 will be the leader of the cluster C2.
Step3: For all the remaining data items the distance between the data point and the leader of the clusters is calculated. If the distance between the data items and the any of the leader is less then the user specified threshold, the data point is assigned to that cluster. However, If the distance between the data point and the any of the cluster's leader is more than the user specified threshold, a new cluster is created and that particular data point is assigned to that cluster and considered the leader of the cluster.
Step 4: Repeat Step 3 till all the data items are assigned to clusters.
Example to get the theory clear.
Consider the patterns are located at
A (1, 1),B(1, 2), C(2, 2), D(6, 2), E(7, 2), F(6, 6), G(7, 6)
Let the data be processed in the order A, B, C, D, E, F and G
, and the user specified threshold T
be 3
.
A(1, 1)
is the first data item processed and it is assigned to cluster C1
and also made the leader of C1
.
For the second point B
, its distance from the leader A
is calculated.
Using euclidean distance formula ( Distance(a, b)) = √(x - a)² + (y - b)²
), we get the distance as √(1 - 1)² + (1 - 2)² = 1
, this is less than the user specified threshold 3
, so B
is assigned to the Cluster 1.
For the third point C(2, 2)
the distance between leader A(1, 1)
of cluster C1
and point C
is calculated. Using the Euclidean formula the distance is √(1 - 2)² + (1 - 2)² = 1.41
, which is less than
threshold, so C
is also assigned to C1
.
Distance between A and D (√(1 - 6)² + (1 - 2)² = 5.099 ) is more than the user specified threshold 3, so a new cluster is created and D is assigned to cluster C2
. D is the leader of this cluster.
For the point E
, its distance from A
(leader of C1
) and D
(leader of C2
) is calculated. Since Distance(D,E)
is less then the user specified threshold 3
, it is assigned to cluster 2.
The distance of F from A
(the leader of C1) is 7.07
and from D
(the leader of C2) is 4
.
Both these distances are above the threshold and therefore F
is put into the new cluster C3
and is made the leader of this cluster.
For G
the Distance(A,G)
, Distance(D,G)
and Distance(F,G)
are 7.81
, 6.41
and 1
respectively. Since Distance(F,G)
is less then user specified 3
it is assigned to cluster 3.
It can be seen that if the data had been processed in a different order, the cluster
leaders would have been different—even the clusters can vary. If C
occurred before
A
and B
, then C
would have been the leader of C1
. If D
occurred before and the
distance between C
and D
was less than the threshold, it would have been in C1
. This
may not occur if A
is the leader. The leader algorithm is therefore order dependent and may give different results based on the order of processing.