Search code examples
matplotlibhistogramcluster-analysisdata-miningdbscan

Plotting a histogram of 2D numpyArray of (latitude, latitude), in order to determine the proper values for DBSCAN


I am trying to apply DBSCAN on a dataset of (Lan,Lat) .. The algorithm is very sensitive for the parameter; EPS & MinPts.

I would like to have a look through a Histogram over the data, to determine the proper values. Unfortunately, Matplotlib Hist() take only 1D array.

Passing a 2D matrix as argument, Hist() treats each column as a separate input.

Scatter plot and histograms:

Scatter Plot for Data

Historgram

Does anyone has a way to solve this,


Solution

  • If you follow the DBSCAN article, you only need the 4-nearest-neighbor distance for each object, not all pairwise distances. I.e., a 1 dimensional array.

    Instead of doing a histogram, they sort the values, and try to choose a knee in this plot.

    1. find the 4 nearest neighbor of each object
    2. collect all 4NN distances in one array
    3. sort this array in descending order
    4. plot the resulting curve
    5. look for a knee, often best at around 5%-10% of your x axis (so 95%-90% of objects are core points).

    For details, see the original DBSCAN publication!