Search code examples
data-miningoutlierselki

the structure of the input file to ELKI while running LOF


I want to run the LOF algorithm with ELKI's GUI, but I can't figure out what sort of input file it expects. I have looked over here and when I try to give it an input csv file with space-separated attribute values for each instance (including the categorical attribute for class, the rest of the attributes are numeric). The file is similar to this (with no header):

5   1   1   1   2   1   3   1   1   benign
5   4   4   5   7   10  3   2   1   benign
3   1   1   1   2   2   3   1   1   benign
6   8   8   1   3   4   3   7   1   benign
4   1   1   3   2   1   3   1   1   benign
8   10  10  8   7   10  9   7   1   malignant
1   1   1   1   2   10  3   1   1   benign
2   1   2   1   2   1   3   1   1   benign
2   1   1   1   2   1   1   1   5   benign
4   2   1   1   2   1   2   1   1   benign
1   1   1   1   1   1   3   1   1   benign
2   1   1   1   2   1   2   1   1   benign
5   3   3   3   2   3   4   4   1   malignant

I chose dbc.in as the .csv file, dbc.parser as NumberVectorLabelParser, ClassLabelFilter's index as 9 (since thats the index of the column with the class label) and k = 11

However, it gives me this error:

Task failed
de.lmu.ifi.dbs.elki.utilities.exceptions.AbortException: Cannot initialize class labels: 9
    at de.lmu.ifi.dbs.elki.datasource.filter.typeconversions.ClassLabelFilter.filter(ClassLabelFilter.java:106)
    at de.lmu.ifi.dbs.elki.datasource.AbstractDatabaseConnection.invokeStreamFilters(AbstractDatabaseConnection.java:114)
    at de.lmu.ifi.dbs.elki.datasource.InputStreamDatabaseConnection.loadData(InputStreamDatabaseConnection.java:91)
    at de.lmu.ifi.dbs.elki.database.StaticArrayDatabase.initialize(StaticArrayDatabase.java:119)
    at de.lmu.ifi.dbs.elki.workflow.InputStep.getDatabase(InputStep.java:62)
    at de.lmu.ifi.dbs.elki.KDDTask.run(KDDTask.java:108)
    at de.lmu.ifi.dbs.elki.application.KDDCLIApplication.run(KDDCLIApplication.java:60)
    at [...]
Caused by: java.lang.ArrayIndexOutOfBoundsException: 9
    at de.lmu.ifi.dbs.elki.data.LabelList.get(LabelList.java:109)
    at de.lmu.ifi.dbs.elki.datasource.filter.typeconversions.ClassLabelFilter.filter(ClassLabelFilter.java:102)
    at de.lmu.ifi.dbs.elki.datasource.AbstractDatabaseConnection.invokeStreamFilters(AbstractDatabaseConnection.java:114)
    at de.lmu.ifi.dbs.elki.datasource.InputStreamDatabaseConnection.loadData(InputStreamDatabaseConnection.java:91)
    at de.lmu.ifi.dbs.elki.database.StaticArrayDatabase.initialize(StaticArrayDatabase.java:119)
    at de.lmu.ifi.dbs.elki.workflow.InputStep.getDatabase(InputStep.java:62)
    at de.lmu.ifi.dbs.elki.KDDTask.run(KDDTask.java:108)
    at de.lmu.ifi.dbs.elki.application.KDDCLIApplication.run(KDDCLIApplication.java:60)
    at [...]

If I don't use the ClassLabel filter, then a popup dialog box appears with this message:

SVG Error:

null:-1
The attribute "d" of the element <path> is invalid

Could anyone please help me run the algorithm? Would appreciate the help very much, thanks!


Solution

  • Update: missing values in the input data broke the histogram visualizer. This has been fixed, and will work in the next release (missing values will simply be ignored though - there won't be a separate histogram bar to indicate the number of missing values. Contributions are welcome!)

    The class label index is relative to the labels, not to the CSV file: the filter only sees the labels, nothing else. So you want to use column 0 as class label.

    But you shouldn't need to use this filter at all.

    The SVG visualization error likely arises because of ∞ (infinity, encoded as UTF-8, or maybe NaN too) resulting somewhere. This could be either due to NaN values in the input data, or because of duplicates in the data.

    In this data set, there are 27 copies of the record

    1,1,1,1,2,1,1,1,1
    

    The way LOF is defined, if you have k or more duplicates, the LOF score can become infinite. At this point, some visualization module fails and sets the some radius or some scale to infinity, and that isn't valid SVG anymore then (I'll open a bug ticket for that)! Welcome to the real world of messy data. ;-)

    Workaround 1: choose k = 30 or larger.

    Workaround 2: don't use the visualization, write the data to a file instead; or enable only the visualizers you need, e.g. -vis.enable scatter|unproj

    Workaround 3: remove duplicates first.

    Workaround 3: Remove rows with missing values.

    None of these changes makes the data set a good choice for outlier detection, however: the "outliers" in this data set cluster.