Search code examples
pythonfor-loopmachine-learningcluster-analysisdbscan

DBSCAN Silhouette Coefficients: does this for-loop work?


I'm trying to compare the results of my classmates Silhouette Score calculations to mine, and am having some trouble wrapping my head around their for-loop. I'm not looking for freebies, we've already submitted the below for grading, just trying to understand what's going on here for future reference.

The question:

Using DBSCAN iterate (for-loop) through different values of min_samples (1 to 10) and epsilon (.05 to .5, in steps of .01) to find clusters in the road-data used in the Lesson and calculate the Silohouette Coeff for min_samples and epsilon.

road-data:

             osm         lat          lon         alt
0      144552912    9.349849    56.740876   17.052772
1      144552912    9.350188    56.740679   17.614840
2      144552912    9.350549    56.740544   18.083536
...
434873  93323209    9.943451    57.496270   24.635285

434874 rows × 4 columns

(Updated Edit) Normalized:

#Normalize sample from dataset
XX = X.copy()
XX['alt'] = (X.alt - X.alt.mean())/X.alt.std()
XX['lat'] = (X.lat - X.lat.mean())/X.lat.std()
XX['lon'] = (X.lon - X.lon.mean())/X.lon.std()

Classmates loop:

start   = 0.0
stop    = 0.45
step    = 0.01
my_list = np.arange(start, stop+step, step)

startb   = 1
stopb    = 10
stepb    = .2 # To scale proportionately with epsilon increments
my_listb = np.arange(startb, stopb+stepb, stepb)

my_range = range(45)

one = []

for i in tqdm(my_range):
   dbscan = DBSCAN(eps = .05 + my_list[i] , min_samples = 1 + my_listb[i])
   XX.cluster = dbscan.fit_predict(XX[['lat','lon']])
   one.append(metrics.silhouette_score(XX[['lat', 'lon']], XX.cluster))

Classmates figure

My Loop(s):

(I broke my solution up into 10 loops, one for each min_sample (1-10). Examples below.)

#eps loop 0.05 to 0.5 (steps 0.01) min_samples=1

eps_range = [x / 100.0 for x in range(5,51,1)]
eps_scores_1 = []
for e in tqdm(eps_range):
dbscan = DBSCAN(eps=e, min_samples=1)
labels = dbscan.fit_predict(XX[['lon', 'lat', 'alt']])
eps_scores_1.append(metrics.silhouette_score(XX[['lon', 'lat', 'alt']],labels))

-

#eps loop 0.05 to 0.5 (steps 0.01) min_samples=2

eps_range = [x / 100.0 for x in range(5,51,1)]
eps_scores_2 = []
for e in tqdm(eps_range):
dbscan = DBSCAN(eps=e, min_samples=2)
labels = dbscan.fit_predict(XX[['lon', 'lat', 'alt']])
eps_scores_2.append(metrics.silhouette_score(XX[['lon', 'lat', 'alt']],labels))

My Figure

What I observe, as far as differences:

  1. Classmate did not include 'alt' in their for-loop.
  2. Classmate attempted some kind of nested loop?
  3. Classmate's range is 45, not sure that's right.
  4. Classmate's my_list is not in the correct notation?
  5. Classmate's max Silhouette Scores are much higher than mine.
  6. (not shown) Classmate used 10,000 random samples, I used 30,000 random samples.

Solution

  • The question asks for both minors and epsilon to be varied - it called for a nested loop. Your classmate used a single loop, and did not consider combinations. You did the outer loop by copy and paste.

    Your classmate uses a very misleading way of managing the range, because he adds 0.05 respectively 1 later!

    You cannot just mix latitude, longitude, and altitude. They have different units. In fact, you shouldn't even mix latitude and longitude because of distortion - use Haversine distance instead!

    Silhouette assumes convex clusters, but DBSCAN does not generate convex clusters.

    The sklearn implementation likely treats noise just like a cluster, which will usually give worse results. But Silhouette is not really meant to be used with noise labels...