Search code examples
cluster-analysisdbscanelki

Clustering Multiple Attributes with ELKI


I use the ELKI framework to cluster a series of points, defined by their geographic coordinates (longitude, latitude). The algorithm I use is DBSCAN.

Now I would like to add another (numerical) attribute that weights the importance of the points (let's say size).

In theory, the points would now be defined in a 3 dimensional space (rather than 2D) and the distance would be a mixture of geographic distance and data distance.

In practice, I tried to do this in ELKI, but I step into a concrete problem. The clustering algorithms expect a "database" as an input.

Clustering<DBSCANModel> de.lmu.ifi.dbs.elki.algorithm.AbstractAlgorithm.run(Database database)

This database is created from a LisParametrization, which amongst other things, reads a database connection:

    params.addParameter(
        AbstractDatabase.Parameterizer.DATABASE_CONNECTION_ID, dbc);

Finally, this database connection reads the data from a 2D array:

Import an existing data matrix (double[rows][cols]) into an ELKI database.

    DatabaseConnection dbc = new ArrayAdapterDatabaseConnection(array[][]);

My question is: is there any way of replacing this 2D array for a *D matrix?

For instance in my case, I would like to use a 3D array, to store the two geographic coordinates and the numerical attribute. Something like this:

array[][][]


Solution

  • If you want to put weight on the instances, you should switch to GeneralizedDBSCAN, and implement a weighted CorePredicate.

    double[rows][cols]
    

    is fine. You have three columns: longitude, latitute, weight.

    DimensionSelectingLatLngDistanceFunction can work with 3D vectors, too. You just have to specify in which column latitude, and in which column longitude is stored.

    Alternatively, you can build your own DatabaseConnection. It could return two relations: one is a 2d vector field containing latitude and longitude, the second is a 1d relation containing the weights only. But working with multiple relations can be tricky. Above approach is easier to use.