Search code examples
rcluster-analysisdbscanmlr

Tuning without resampling in mlr package (clustering)


In the mlr package, I can perform a clustering. Let´s say I don´t want to know how the model performs on unseen data, but I just want to know what the best number of clusters are regarding a given performance measure.

In this example, I use the moons data set of the dbscan package.

library(mlr)
library(dbscan)
data("moons")

db_task = makeClusterTask(data = moons)

db = makeLearner("cluster.dbscan")

ps = makeParamSet(makeDiscreteParam("eps", values = seq(0.1, 1, by = 0.1)),
  makeIntegerParam("MinPts", lower = 1, upper = 5))

ctrl = makeTuneControlGrid()

rdesc = makeResampleDesc("CV", iters = 3) # I don´t want to use it, but I have to 

res = tuneParams(db, 
  task = db_task, 
  control = ctrl,
  measures = silhouette, 
  resampling = rdesc, 
  par.set = ps)
#> [Tune] Started tuning learner cluster.dbscan for parameter set:
#>            Type len Def                                Constr Req Tunable
#> eps    discrete   -   - 0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9,1   -    TRUE
#> MinPts  integer   -   -                                1 to 5   -    TRUE
#>        Trafo
#> eps        -
#> MinPts     -
#> With control class: TuneControlGrid
#> Imputation value: Inf
#> [Tune-x] 1: eps=0.1; MinPts=1
#> Error in matrix(nrow = k, ncol = ncol(x)): invalid 'nrow' value (too large or NA)

Created on 2019-06-06 by the reprex package (v0.3.0)

However, mlr forces me to use a resampling strategy. Any idea of how to use mlr in cluster tasks without resampling?


Solution

  • mlr is pretty poor when it comes to clustering. It's dbscan function is a wrapper around the very slow fpc package. Others wrap Weka, which is also very slow.

    Use the dbscan package instead.

    However, parameter tuning doesn't just work in unsupervised settings. You don't have labels, so you only have unreliable "internal" heuristics instead. And most of these are not reliable for DBSCAN because they will assume noise is a cluster, but it isn't. Few tools have support for noise in evaluation (I've seen options for this in ELKI), and I'm not convinced that either of the variants to handle noise is good. You can construct undesirable cases for each variant IMHO. You probably need to use at least two measures in the evaluation of clustering with noise.