Search code examples
pythonparametersdbscan

DBSCAN in python - Running out of memory


My data has 1 million Lat, Long Coordinate pairs. I am using DBSCAN alorithm with haversine distance measure. However this algorithm runs only for a subset of data 8000 records so far and if I try to run on the entire dataset, running out of memory within seconds. Can someone help on this?


Solution

  • Usually, you would use epsilon on the distance between the points, i.e. latitude and longitude.

    But then count is not used at all.

    Please read up on generalized DBSCAN on the customizations to apply DBSCAN on such data. Regular DBSCAN (nor any other clustering algorithm) will run out of the box on your data. You may also want to look into spatial autocorrelation.