Search code examples
pythonpython-2.7scikit-learncluster-analysisdbscan

Why is DBSCAN.fit() faster with more features?


I'm playing around with DBSCAN. I'm wondering why the execution time decreases as I increase the number of features (see plot below). I would expect execution time to increase as the number of features increases...

import timeit
import functools
import numpy as np
import matplotlib.pyplot as plt
from sklearn.metrics.pairwise import euclidean_distances
from sklearn.cluster import DBSCAN

features = [2, 4, 8, 10]
training_examples = [100, 500, 1000,2000]
n_iterations = 10

x = np.asarray(training_examples)

for num_features in features:

    average_execution_time = []

    for num_training_examples in training_examples:
        # generate matrix of random training examples
        X = np.random.rand(num_training_examples, num_features)

        # generate a symmetric distance matrix
        D = euclidean_distances(X, X)

        # DBSCAN parameters
        eps = 0.5
        kmedian_thresh = 0.005
        min_samples = 5

        db = DBSCAN(eps=eps,
                    min_samples=min_samples,
                    metric='precomputed')

        # Call timeit
        t = timeit.Timer(functools.partial(db.fit, D))

        average_execution_time.append(t.timeit(n_iterations) / n_iterations)

    y = np.asarray(average_execution_time)

    plt.plot(x, y, label='{} features'.format(num_features))

plt.xlabel('No. of Training Examples')
plt.ylabel('DBSCAN.fit() time to Cluster')
plt.title('DBSCAN.fit() avg time to Cluster')
plt.legend()
plt.grid()
plt.show()

enter image description here


Solution

  • The DBSCAN algorithm basically requires 2 parameters:

    eps: specifies how close points should be to each other to be considered a part of a cluster. It means that if the distance between two points is lower or equal to this value (eps), these points are considered neighbors.
    
    minPoints: the minimum number of points to form a dense region. For example, if we set the minPoints parameter as 5, then we need at least 5 points to form a dense region.
    

    I think your question is related to both types of paramerers.

    eps: if the eps value chosen is too small, a large part of the data will not be clustered. It will be considered outliers because don’t satisfy the number of points to create a dense region. On the other hand, if the value that was chosen is too high, clusters will merge and the majority of objects will be in the same cluster. The eps should be chosen based on the distance of the dataset (we can use a k-distance graph to find it), but in general small eps values are preferable. Basically, bigger = faster.

    minPoints: As a general rule, a minimum minPoints can be derived from a number of dimensions (D) in the data set, as minPoints ≥ D + 1. Larger values are usually better for data sets with noise and will form more significant clusters. The minimum value for the minPoints must be 3, but the larger the data set, the larger the minPoints value that should be chosen. Basically, bigger = faster.