Search code examples
rcluster-computingcentroid

Find the cluster centroid closest to a predicted coordinate and return the cluster of the closest centroid


I am predicting latitude and longitude coordinates. When I predict for example the latitude coordinate, I want to compare this prediction to another variable which contains the cluster centroids of the clusters I made for the latitude and longitude. I want to return the cluster (which I have in another variable) of the cluster centroid closest to the predicted latitude coordinate. I do have the right setup due to another post on Stackoverflow, but I don't get the right cluster as an answer. Can someone help me to see what I did wrong?

I want the 'predclustertest' variable to contain the cluster (ClusterEnd) that belongs to the ClusterEndLatitudeCenter which is closest to the prediction of the latitude (predictions_test)

df <- dfTraining %>%
group_by(TripID) %>%
mutate(pred_cluster_test = case_when(ClusterEnd_LatitudeCenter == predictions_test ~
ClusterEnd[ClusterEnd_LatitudeCenter],TRUE ~ ClusterEnd[sapply(ClusterEnd_LatitudeCenter,
function(x) which.min(x - predictions_test))]))

This is what the data looks like:

structure(list(EndLatitude = c(38.26, 38.218, 38.255, 38.258, 
38.213, 38.215), EndLongitude = c(-85.75, -85.754, -85.746, -85.751, 
-85.751, -85.757), ClusterEnd = c(1, 4, 1, 5, 4, 4), ClusterEnd_LatitudeCenter = c(38.25629, 
38.21723, 38.25629, 38.25322, 38.21723, 38.21723), ClusterEnd_LongitudeCenter = c(-85.74133, 
-85.75955, -85.74133, -85.75783, -85.75955, -85.75955), predictions_test = c(`1` = 38.2407296518939, 
`2` = 38.2326115950784, `3` = 38.2428487622735, `4` = 38.2449069816005, 
`5` = 38.234314694847, `6` = 38.2347388488934), pred_cluster_test = c(38.25629, 
38.21723, 38.25629, 38.25322, 38.21723, 38.21723)), row.names = c(NA, 
-6L), class = c("tbl_df", "tbl", "data.frame"))

Solution

  • Provided that I understand correctly what is expected the following may work:

    library(dplyr)
    
    foo <- function(x, cluster_coords) {
      mat <- cbind(x, cluster_coords)
      distance <- apply(mat, MARGIN = 1, FUN = dist, method = "euclidean")
      which.min(distance)
    }
    
    df %>% 
      mutate(
        cluster_pred_test = ClusterEnd[
        sapply(
          predictions_test,
          function(x) foo(x, ClusterEnd_LatitudeCenter)
          )
        ]
      ) %>%
      pull(cluster_pred_test)
    [1] 5 4 5 5 4 4
    

    You may want to edit this to include both your coordinates, and look into the dplyr::group_map and dplyr::group_modify functions which may help you achieve efficient, grouped operations.