Search code examples
matlabcluster-analysismultistage

Matlab: How to customize a clustering code to be a multistage clustering?


I want to cluster a huge amount of data records. The data that I'm dealing with are of string type. The clustering process takes a long time.
Let us assume that I want to cluster a set of email data records into cluster, where emails written by the same person are allocated to the same cluster (taking into account that a person might write his/her name in different ways).
I want to perform a multi stage clustering:

  • First stage clustering based on name, if the name distance between two records is less than a threshold we consider these clusters otherwise...
  • The data records enters the second stage of clustering based on other attributes (other than name).

The pairwise distance is calculated. Now I'm in the clustering phase. I want to use the following code for dbscan clustering:

function [IDX, isnoise] = dbscan_strings(X,epsilon,MinPts)
C = 0;
n = size(X,1); 
IDX = zeros(n,1);
D = pdist2(X,X,@intersection);
visited = false(n,1);
isnoise = false(n,1);
for i = 1:n
    if ~visited(i)
        visited(i) = true;
        Neighbors = RegionQuery(i);
        if numel(Neighbors)<MinPts
            % X(i,:) is NOISE
            isnoise(i) = true;
        else
            C = C+1;
            ExpandCluster(i,Neighbors,C);
        end
    end
end

function ExpandCluster(i,Neighbors,C)
    IDX(i) = C;
    k = 1;
    while true
        j = Neighbors(k);
        if ~visited(j)
            visited(j) = true;
            Neighbors2 = RegionQuery(j);
            if numel(Neighbors2)>=MinPts
                Neighbors = [Neighbors Neighbors2];   %#ok
            end
        end
        if IDX(j)==0
            IDX(j) = C;
        end
        k = k + 1;
        if k > numel(Neighbors)
            break;
        end
    end
end

function Neighbors = RegionQuery(i)
    Neighbors = find(D(i,:)<=epsilon);
end
end

I need help in making the following clustering process into a multistage process where X contains data records with all attributes. Let us assume that X{:,1} is the data records with the name attribute, since the name is contained in the first column.

NOTE: I will give a bounty of 50 points for the one who helps me.


Solution

  • Don't do everything at once!

    You are computing a lot of things that you never need, which makes things slow. For example, a good DBSCAN does not use a distance function, but an index.

    For the names, only work on unique names! You supposedly have many exact same names, but you end up computing the same distances again and again.

    So first of all, build a set of unique names only. Perform your similarity matching on this (I would however suggest to use OpenRefine for this rather than Matlab!). Once you have identified names to merge, build a new data matrix for every name group. Then run whatever clustering you want. Good candidates are probably HDBSCAN, and OPTICSXi (have a look at the clustering algorithms available in ELKI, which probably has the widest selection to choose from). Maybe start only with an average common name, to get a feeling of the parameters for the algorithm. Don't cluster all subsets at once.