Search code examples
cluster-analysiselki

ELKI: How to Specify Feature Columns of CSV for K-Means


I am trying to run K-Means using ELKI MiniGUI. I have a CSV dataset of 15 features (columns) and a label column. I would like to do multiple runs of K-Means with different combinations of the feature columns.

Is there anywhere in the MiniGUI where I can specify the indeces of which columns I would like to be used for clustering?

If not, what is the simplest way to achieve this by changin/extending ELKI in Java?


Solution

  • This is obivously easily achievable with Java code, or simply by preprocessing the data as necessary. Generate 10 variants, then launch ELKI via the command line.

    But there is a filter to select columns: NumberVectorFeatureSelectionFilter. To only use columns 0,1,2 (in the numeric part; labels are treated separately at this point; this is a vector transformation):

    -dbc.filter transform.NumberVectorFeatureSelectionFilter
    -projectionfilter.selectedattributes 0,1,2
    

    The filter could be extended using our newer IntRangeParameter to allow for specifications such as 1..3,5..8; but this has not been implemented yet.