I want to cluster a huge amount of data records. The data that I'm dealing with are of string type. The clustering process takes a long time.
Let us assume that I want to cluster a set of email data records into cluster, where emails written by the same person are allocated to the same cluster (taking into account that a person might write his/her name in different ways).
I want to perform a multi stage clustering:
The pairwise distance is calculated. Now I'm in the clustering phase. I want to use the following code for dbscan
clustering:
function [IDX, isnoise] = dbscan_strings(X,epsilon,MinPts)
C = 0;
n = size(X,1);
IDX = zeros(n,1);
D = pdist2(X,X,@intersection);
visited = false(n,1);
isnoise = false(n,1);
for i = 1:n
if ~visited(i)
visited(i) = true;
Neighbors = RegionQuery(i);
if numel(Neighbors)<MinPts
% X(i,:) is NOISE
isnoise(i) = true;
else
C = C+1;
ExpandCluster(i,Neighbors,C);
end
end
end
function ExpandCluster(i,Neighbors,C)
IDX(i) = C;
k = 1;
while true
j = Neighbors(k);
if ~visited(j)
visited(j) = true;
Neighbors2 = RegionQuery(j);
if numel(Neighbors2)>=MinPts
Neighbors = [Neighbors Neighbors2]; %#ok
end
end
if IDX(j)==0
IDX(j) = C;
end
k = k + 1;
if k > numel(Neighbors)
break;
end
end
end
function Neighbors = RegionQuery(i)
Neighbors = find(D(i,:)<=epsilon);
end
end
I need help in making the following clustering process into a multistage process where X
contains data records with all attributes. Let us assume that X{:,1}
is the data records with the name attribute, since the name is contained in the first column.
NOTE: I will give a bounty of 50 points for the one who helps me.
Don't do everything at once!
You are computing a lot of things that you never need, which makes things slow. For example, a good DBSCAN does not use a distance function, but an index.
For the names, only work on unique names! You supposedly have many exact same names, but you end up computing the same distances again and again.
So first of all, build a set of unique names only. Perform your similarity matching on this (I would however suggest to use OpenRefine for this rather than Matlab!). Once you have identified names to merge, build a new data matrix for every name group. Then run whatever clustering you want. Good candidates are probably HDBSCAN, and OPTICSXi (have a look at the clustering algorithms available in ELKI, which probably has the widest selection to choose from). Maybe start only with an average common name, to get a feeling of the parameters for the algorithm. Don't cluster all subsets at once.