Search code examples
sparse-matrixoutlierselki

How to work with sparse data using ELKI?


i'm trying to use a sparse matrix as input data in ELKI SOD algorithm to detect outliers. I was looking for help in howto and faqs page about sparse data, so i've tried to use SparseNumberVectorLabelParser and SparseVectorFieldFilter like this:

//data is a mxn matrix    
ArrayAdapterDatabaseConnection dataArray = new ArrayAdapterDatabaseConnection(data);
SparseDoubleVector.Factory sparseVector = new SparseDoubleVector.Factory();
SparseNumberVectorLabelParser<SparseDoubleVector> parser = new SparseNumberVectorLabelParser<SparseDoubleVector>(Pattern.compile("s*[,;s]s*")," \" ",Pattern.compile("^s*(#|//|;).*$"),null, sparseVector);
SparseVectorFieldFilter<SparseDoubleVector>  sparseFilter = new SparseVectorFieldFilter<SparseDoubleVector>();

ListParameterization params = new ListParameterization();
params.addParameter(AbstractDatabase.Parameterizer.DATABASE_CONNECTION_ID, dataArray);
params.addParameter(AbstractDatabaseConnection.Parameterizer.PARSER_ID, parser);
params.addParameter(AbstractDatabaseConnection.Parameterizer.FILTERS_ID, sparseFilter);
Database db = ClassGenericsUtil.parameterizeOrAbort(StaticArrayDatabase.class, params);
db.initialize();

params = new ListParameterization();
params.addParameter(SOD.Parameterizer.KNN_ID, 25);
params.addParameter(SharedNearestNeighborPreprocessor.Factory.NUMBER_OF_NEIGHBORS_ID, 10);
SOD<DoubleVector> sodAlg = ClassGenericsUtil.parameterizeOrAbort(SOD.class, params);
OutlierResult result = sodAlg.run(db); 

But i've got this runtime exception:

Exception in thread "main" de.lmu.ifi.dbs.elki.data.type.NoSupportedDataTypeException: No data type found satisfying: NumberVector,field
Available types: DBID DoubleVector,field,mindim=7606,maxdim=12968
at de.lmu.ifi.dbs.elki.database.AbstractDatabase.getRelation(AbstractDatabase.java:154)
at de.lmu.ifi.dbs.elki.algorithm.AbstractAlgorithm.run(AbstractAlgorithm.java:80)

Is this the right way to use SparseNumberVectorLabelParser and SparseVectorFieldFilter within java code?


Solution

  • ArrayAdapterDatabaseConnection is designed for dense data. For sparse data, it does not make much sense to first encode it into a dense array, then re-encode it into sparse vectors. Consider reading the data as sparse vectors directly to avoid overhead.

    The error you are seeing has a different reason, though: SOD is specified on a vector field of fixed dimensionality, but the sparse vectors yield a relation that has a variable dimensionality. So it doesn't find the requested data type (hence, NoSupportedDataTypeException).

    You can force the data to be of fixed dimensionality using SparseVectorFieldFilter.

    But I'm not sure if SOD is an appropriate algorithm to use on sparse data. Even though it should work then, the runtime and performance may be bad; because the algorithm isn't operating on data that satisfies the assumptions it was designed for. Sparse data is usually best handled with algorithms that exploit and handle data sparsity. (As is, you would also compute the shared nearest neighbors using Euclidean distance, which may not work well for sparse data. If the SNN are bad, SOD won't work well either)