Search code examples
cluster-analysiselki

Can ELKI cluster non-normalized negative points?


I have gone through this question but the solution doesn't help. ELKI Kmeans clustering Task failed error for high dimensional data

This is my first time with ELKI so, please bear with me. I have 45000 2D data points (after performing doc2vec ) that contain negative values and are not normalized. The dataset looks something like this :

-4.688612   32.793335
-42.990147  -20.499323
-24.948868  -10.822767
-45.502155  -40.917801
27.979715   -40.012688
1.867812    -9.838544
56.284512   6.756072

I am using the K-means algorithm to get 2 clusters. However, I get the following error:

Task failed
de.lmu.ifi.dbs.elki.data.type.NoSupportedDataTypeException: No data type found satisfying: NumberVector,field AND NumberVector,variable
Available types: DBID DoubleVector,variable,mindim=0,maxdim=1 LabelList
    at de.lmu.ifi.dbs.elki.database.AbstractDatabase.getRelation(AbstractDatabase.java:126)
    at de.lmu.ifi.dbs.elki.algorithm.AbstractAlgorithm.run(AbstractAlgorithm.java:81)
    at de.lmu.ifi.dbs.elki.workflow.AlgorithmStep.runAlgorithms(AlgorithmStep.java:105)
    at de.lmu.ifi.dbs.elki.KDDTask.run(KDDTask.java:112)
    at de.lmu.ifi.dbs.elki.application.KDDCLIApplication.run(KDDCLIApplication.java:61)
    at [...]

So my question is, does ELKI require the data to be in the range of [0,1] because all the examples that I came across had their data within that range.

Or is it that ELKI does not accept negative values?

If something else, can someone please guide me through this?

Thank you!


Solution

  • ELKI can handle negative values just fine.

    Your input data is not correctly formatted. Same problem as in ELKI Kmeans clustering Task failed error for high dimensional data

    Apparently your lines have either 0 or 1 values. ELKI itself is fine with that, but k-means requires the data to be in a R^d vector space, hence ELKI cannot run k-means on your data set. But the reason is that the input file is bad. You may want to double check your file - there probably is at least one line that is not properly formatted.