Search code examples
ralgorithmcluster-analysishierarchical-clusteringunsupervised-learning

Best way to cluster long/lat hotspot points in one city in R?


I am new to R and (unsupervised) machine learning. I'm trying to find out the best cluster solution for my data in R.

What is my data about?

I have a dataset with +/- 800 long / lat WGS84 coordinates in one city.

Long is in the range 6.90 - 6.95 lat is in the range 52.29 - 52.33

What do I want?

I want to find "hotspots" based on their density. As example: minimum 5 long/lat points in a range of 50 meter. This is a point plot example:

point plot

Why do I want this?

As example: let's assume that every single point is a car accident. By clustering the points I hope to see which areas need attention. (min x points in a range of x meter needs attention)

What have I found?

The following clustering algorithms seems possible for my solution:

  1. DBscan (https://cran.r-project.org/web/packages/dbscan/dbscan.pdf)
  2. HDBscan(https://cran.r-project.org/web/packages/dbscan/vignettes/hdbscan.html)
  3. OPTICS (https://www.rdocumentation.org/packages/dbscan/versions/0.9-8/topics/optics)
  4. City Clustering Algorithm (https://cran.r-project.org/web/packages/osc/vignettes/paper.pdf)

My questions

  1. What is the best solution or algorithm for my case in R?
  2. Is it true that I have to convert my long/lat to a distance / Haversine matrix first?

Solution

  • Find something interested on: https://gis.stackexchange.com/questions/64392/finding-clusters-of-points-based-distance-rule-using-r

    I changed this code a bit, using the outliers as places where a lot happens

    # 1. Make spatialpointsdataframe #
    
    xy <- SpatialPointsDataFrame(
      matrix(c(x,y), ncol=2), data.frame(ID=seq(1:length(x))),
      proj4string=CRS("+proj=longlat +ellps=WGS84 +datum=WGS84"))
    
    # 2. Use DISTM function to generate distance matrix.# 
    mdist <- distm(xy)
    
    # 3. Use hierarchical clustering with complete methode#
    hc <- hclust(as.dist(mdist), method="complete")
    
    # 4. Show dendogram#
    plot(hc, labels = input$street, xlab="", sub="",cex=0.7)
    
    # 5. Set distance: in my case 300 meter#
    d=300
    
    # 6. define clusters based on a tree "height" cutoff "d" and add them to the SpDataFrame
    xy$clust <- cutree(hc, h=d)
    
    # 7. Add clusters to dataset#
    input$cluster <- xy@data[["clust"]]
    
    # 8. Plot clusters #
    plot(input$long, input$lat, col=input$cluster, pch=20)
    text(input$long, input$lat, labels =input$cluster)
    
    

    300 m cluster

    # 9. Count n in cluster#
    selection2 <- input %>% count(cluster)
    
    # 10. Make a boxplot #
    boxplot(selection2$n)
    
    #11. Get first outlier#
    outlier <- boxplot.stats(selection2$n)$out
    outlier <- sort(outlier)
    outlier <- as.numeric(outlier[1])
    
    #12. Filter clusters greater than outlier#
    selectie3 <- as.vector(selection2 %>% filter(selection2$n >= outlier[1]) %>% select(cluster))
    
    #13. Make a new DF with all outlier clusters#
    heatclusters <- input %>% filter(cluster%in% c(selectie3$cluster))
    
    #14. Plot outlier clusters#
    plot(heatclusters$long, heatclusters$lat, col=heatclusters$cluster)
    

    outlier cluster

    #15. Plot on density map ##
    googlemap + geom_point(aes(x=long , y=lat), data=heatclusters, color="red", size=0.1, shape=".") +
      stat_density2d(data=heatclusters,
                     aes(x =long, y =lat, fill= ..level..), alpha = .2, size = 0.1,
                     bins = 10, geom = "polygon") + scale_fill_gradient(low = "green", high = "red") 
    
    

    Don't know if this a good solution. But it seems to work. Maybe someone has any other suggestion?