In the mlr package, I can perform a clustering. Let´s say I don´t want to know how the model performs on unseen data, but I just want to know what the best number of clusters are regarding a given performance measure.
In this example, I use the moons data set of the dbscan package.
library(mlr)
library(dbscan)
data("moons")
db_task = makeClusterTask(data = moons)
db = makeLearner("cluster.dbscan")
ps = makeParamSet(makeDiscreteParam("eps", values = seq(0.1, 1, by = 0.1)),
makeIntegerParam("MinPts", lower = 1, upper = 5))
ctrl = makeTuneControlGrid()
rdesc = makeResampleDesc("CV", iters = 3) # I don´t want to use it, but I have to
res = tuneParams(db,
task = db_task,
control = ctrl,
measures = silhouette,
resampling = rdesc,
par.set = ps)
#> [Tune] Started tuning learner cluster.dbscan for parameter set:
#> Type len Def Constr Req Tunable
#> eps discrete - - 0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9,1 - TRUE
#> MinPts integer - - 1 to 5 - TRUE
#> Trafo
#> eps -
#> MinPts -
#> With control class: TuneControlGrid
#> Imputation value: Inf
#> [Tune-x] 1: eps=0.1; MinPts=1
#> Error in matrix(nrow = k, ncol = ncol(x)): invalid 'nrow' value (too large or NA)
Created on 2019-06-06 by the reprex package (v0.3.0)
However, mlr forces me to use a resampling strategy. Any idea of how to use mlr in cluster tasks without resampling?
mlr
is pretty poor when it comes to clustering. It's dbscan
function is a wrapper around the very slow fpc
package. Others wrap Weka, which is also very slow.
Use the dbscan
package instead.
However, parameter tuning doesn't just work in unsupervised settings. You don't have labels, so you only have unreliable "internal" heuristics instead. And most of these are not reliable for DBSCAN because they will assume noise is a cluster, but it isn't. Few tools have support for noise in evaluation (I've seen options for this in ELKI), and I'm not convinced that either of the variants to handle noise is good. You can construct undesirable cases for each variant IMHO. You probably need to use at least two measures in the evaluation of clustering with noise.