Filtering for a Large Dataset within a 2000-Meter Radius from Grid Points

I have a huge data frame of 13170180 rows, and I have a grid of 1107650 points. I want to filter data points based on a 2000-meter radius from the center of each grid point.

library(geosphere)
library(dplyr)
library(pbapply)
library(ratser)
library(sf)
library(dplyr)  

# Define the resolution of the grid (in degrees)
grid_resolution <- 1 / 60  # 1 arc-minute resolution

# Create an empty raster grid with the specified extent and resolution
grid <- raster(xmn = xmin, xmx = xmax, ymn = ymin, ymx = ymax, 
               resolution = grid_resolution, crs = "+proj=longlat +datum=WGS84")

# Extract grid cell centers (midpoints)
grid_centers <- coordinates(grid)

# Convert to a data frame for easy manipulation
grid_points<- data.frame(grid_centers)
colnames(grid_points) <- c("long_center", "lat_center")

# Define the radius for filtering (2000 meters)
radius <- 2000

# Initialize an empty list to store the points within 2000m for each grid center
points_within_2000m_list <- vector("list", nrow(grid_points))

# Use pblapply to loop through each grid center with a progress bar
points_within_2000m_list <- pblapply(1:nrow(grid_points), function(i) {
  # Extract the grid center coordinates (latitude and longitude)
  target_lat <- grid_points[i, 2]
  target_long <- grid_points[i, 1]
  
  # Calculate the distance between each data point and the current grid center
  data$distance <- distHaversine(
    matrix(c(data$long, data$lat), ncol = 2),
    matrix(c(target_long, target_lat), ncol = 2)
  )
  
  # Filter data points within 2000 meters of the current grid center
  points_within_2000m <- data %>%
    filter(distance <= radius)
  
  # Return the filtered points
  return(points_within_2000m)
})

# Now, `points_within_2000m_list` contains a list of data frames, 
# where each data frame holds the points within 2000 meters of each grid center

Running this code will take 28 days to complete. Is there any way of running it faster?

Solution

Not a complete answer, but I would like to suggest sf::st_intersects() function to find a list of points within buffers.

Creating the raster as starting point for grid.

grid_resolution <- 1 / 60
grid <- terra::rast(xmin = 0, xmax = 4, ymin = 0, ymax = 4, 
               resolution = grid_resolution, crs = "+proj=longlat +datum=WGS84")
terra::values(grid) <- 1:terra::ncell(grid)

Now, creating buffers around every raster cell grid:

radius <- 2000

b <- as.data.frame(grid, xy = TRUE)[, 1:2] |>
  terra::vect(geom = c("x", "y"), crs = terra::crs(grid)) |>
  terra::buffer(radius)

Now, a sample points instead of your data frame, but you can create your own in {sf} package using sf::st_as_sf(df, coords = c("X", "Y"), crs = ...).

points <- terra::spatSample(grid, size = 40000,
                method = "random", xy=TRUE)
points$lyr.1 <- row.names(points)
p <- terra::vect(points, geom = c("x", "y"), crs = terra::crs(grid))

terra::plot(grid)
terra::points(points, cex = 0.1, pch = 10)

Now a bit of {sf} package: we are converting buffers and sample points to sf objects and run intersects() to get the list of intersecting objects:

b_sf <- sf::st_as_sf(b)
p_sf <- sf::st_as_sf(p)

tictoc::tic()
l_sf <- sf::st_intersects(b_sf, p_sf)
tictoc::toc()
#> 10.082 sec elapsed

l_sf
#> Sparse geometry binary predicate list of length 57600, where the
#> predicate was `intersects'
#> first 10 elements:
#>  1: (empty)
#>  2: 1338, 18981
#>  3: 1338, 16901, 20448
#>  4: 1338, 15566, 20448
#>  5: 20448
#>  6: 7345, 34196
#>  7: 4603, 7345, 14446
#>  8: 4603, 7345, 22396
#>  9: 4603, 32411
#>  10: 1159, 30164, 32411

10 sec for 50+ k buffers, not bad on lazy laptop. l_sf[1] is empty, l_sf[2] contains points 1338 and 18981 etc. And sample plot for buffer No. 1016 which contains 5 points:

terra::plot(grid, 
            xlim = c(sf::st_bbox(b_sf[1016, ,])$xmin[[1]]-0.02, sf::st_bbox(b_sf[1016, ,])$xmax[[1]]+0.02),
            ylim = c(sf::st_bbox(b_sf[1016, ,])$ymin[[1]]-0.02, sf::st_bbox(b_sf[1016, ,])$ymax[[1]]+0.02))

b_sf |>
  terra::plot(add = TRUE, lwd=0.5, lty = 3)
b_sf[1016, ,] |>
  terra::plot(add = TRUE, lwd = 2)

p_sf |>
  subset(lyr.1 %in% l_sf[1016][[1]]) |>
  terra::plot(add = TRUE, pch = 10, col = "red")

You can search points for each buffer with {terra} as well, but in my tests it takes a bit longer (by magnitude or so)

l <- lapply(seq_len(nrow(b)), function(x) terra::values(terra::intersect(p, b[x, ]))[, 1])

For further time reduction I would suggest to divide the grid and data points to smaller (overlapping by at least 2000 m) chunks and try {future} package or just few more R sessions at the same time.

^{Created on 2024-10-14 with reprex v2.1.0}