Search code examples
roptimizationdistance

Calcualte points in range with with different properties


I have the following data frame that contains points that originate from different samples. Each point has a type. I need to calculate, for each point belonging to a given sample of a given type (for instance for "Sample_1" type "A") how many points of another type are around it in a given cutoff. My current implementation uses "future.apply" and I was wondering if there is a more efficient way to solve this problem. The example here is limited and should run quickly, the real problem is composed of several thousands of lines and it's much slower. In the end I store the results in a list. This list has, for each element with "type" in "starting_point", the number of elements of type "target_point" in a threshold of 40.

library(future)
library(future.apply)

a_test=data.frame(ID=sample(c("Sample_1", "Sample_2", "Sample_3"), 100, replace=TRUE), type=sample(c("A", "B", "C", "D"), 100, replace = TRUE), xpos=sample(1:200, 100, replace=TRUE), ypos=sample(1:200, 100, replace=TRUE))

starting_point=c("A", "B")
target_point=c("C", "D")
threshold=40
result_per_pair=list()

for(sp in starting_point){
  ## Here I select a data frame of "Starting points" without looking
  ## from which ID they came from
  sp_tdf=a_test[a_test$type==sp, ]
  for(tp in target_point){
    ## Here I select a data frame of "Target points" without looking
    ## from which ID they came from
    tp_tdf=a_test[a_test$type==tp, ]
    ## I use future_sapply here, parallelizing on each line of "sp_tdf"
    plan(multisession)
    elements_around=future_sapply(1:nrow(sp_tdf), function(x, sp_tdf, tp_tdf, treshold2){
      xc=sp_tdf$xpos[x]
      yc=sp_tdf$ypos[x]
      ###  NOTE HERE:  At this point I select the points that are in the same
      ###  ID as the current line of sp_tdf
      tp_tdf2=tp_tdf[tp_tdf$ID == sp_tdf$ID[x],]
      ares=tp_tdf2[ (tp_tdf2$xpos-xc)^2 + (tp_tdf2$ypos-yc)^2 <threshold2, ]
      return(nrow(ares))
    },sp_tdf=sp_tdf, tp_tdf=tp_tdf, threshold2=threshold*threshold)
    
    a_newcol=paste0(tp, "_around_", sp)
    ## we need to create a copy of sp_tdf otherwise we add columns to the
    ## initial sp_tdf and we memorize them in the wrong place in the list
    sp_tdf_temp=sp_tdf
    sp_tdf_temp[,  a_newcol]=elements_around
    
    result_per_pair[[ paste0(tp, "_around_", sp ) ]]=rbind(result_per_pair[[ paste0(tp, "_around_", sp ) ]], sp_tdf_temp)
  }
}

You can see the type of table I get here:

head(result_per_pair[[1]])

$C_around_A
          ID type xpos ypos C_around_A
1   Sample_2    A   26   74          1
2   Sample_3    A   64    8          1
3   Sample_3    A  121    2          1
5   Sample_2    A   62   94          0

Solution

  • You can try using RANN::nn2 function:

    id_list <- split(a_test, a_test$ID) 
    
    res <- id_list %>%
      map(~select(.x, xpos, ypos)) %>%
      map(~RANN::nn2(.x, .x, k = nrow(.), searchtype = "radius", radius = threshold)) %>%
      map(1) %>%
      map2(
        id_list, 
        function(x, y){ 
          seq_len(nrow(x)) %>% 
            map(~x[.x,] %>% .[. > 0]) %>% 
            map(~y[.x,]) %>% 
            map("type") %>%
            map_dfr(table) %>%
            mutate(across(everything(), as.integer))
        }
      ) %>%
      map2_dfr(id_list, ~bind_cols(.y, .x))
    

    Some time improvements might be done replacing tidyverse functions (hard to say how fast it is on your example). Result:

    res %>% head()
    
    ID type xpos ypos A B C D
    Sample_1    C   48  157 0 0 3 1
    Sample_1    D  177   97 1 1 1 3
    Sample_1    C   10   71 0 0 3 0
    Sample_1    C   71  168 1 1 2 0
    Sample_1    D   82   48 1 0 1 2
    Sample_1    C  165   71 3 3 1 1
    

    where columns A-D represent count of types within same ID. I was using seed 123 for generating a_test. You can adapt algorithm to work with starting_point and target_point with spliting each id_list into two parts - those defined by starting_point and target_point and adapting data & query arguments in RANN::nn2.

    edit

    Function based on the ideas of upper comment:

    f <- function(df, threshold, start = levels(df$type), target = levels(df$type)){
      
      my_lists <- df %>%
        filter(type %in% c(start, target)) %>%
        split(.$ID) %>%
        map(
          function(x){ 
            map(
              list(start, target), 
              ~filter(x, type %in% .x) %>% mutate(type = droplevels(type))
            )
          }
        ) %>%
        discard(~any(map_int(.x, nrow) == 0))  
        
      indices <- my_lists %>%
        map(
          ~RANN::nn2(
            data = select(.x[[2]], xpos, ypos), 
            query = select(.x[[1]], xpos, ypos), 
            k = nrow(.x[[2]]), 
            searchtype = "radius", 
            radius = threshold
          )
        ) %>%
        map(1) %>%
        map(function(x) seq_len(nrow(x)) %>% map(~x[.x,] %>% .[. > 0]))
      
      my_lists %>%
        map(2) %>%
        map2(indices, function(x, y) map_dfr(y, ~summary(x[.x,]$type))) %>%
        {map2_dfr(map(my_lists, 1), ., bind_cols)}
    }
    

    To count C around A with radius 40:

    f(a_test, 40, "A", "C")