Search code examples
rdplyrdata.tablercpp

How to conditionally summarize on other entries in the group - R


In my dataset I have Cartesian coordinates of different items overtime identified by an EventID, event_type, ID number, x position, y position, identity type, broad category, and frame id number. What I need to do is go for each EventID, event_type pair, and frame id number go through each ID number and calculate which other ID number with a different broad category has the minimum distance from the current row. I would like to avoid using for loops for this because the dataset is several million lines long.

I tried formulating this as a group_by and summarize call using dplyr but couldn't quite wrap my head around how I could call a function on the current row x, an y against all other x, and ys and then choose the conditional minimum.

two_dim_euclid = function(x1, x2, y1, y2){
  a <- sqrt((x1 - x2)^2 + (y1 - y2)^2)
  return(a)
}


# Example Data
df <- data.frame(stringsAsFactors = FALSE,
                 EventID = c(1003, 1003, 1003, 1003),
                 event_type = c(893, 893, 893, 893),
                 ID_number = c(80427, 2346, 24954, 27765),
                 x = c(86.07, 72.4, 43.08, 80.13),
                 y = c(35.58, 26.43, 34.8, 34.79),
                 identity_type = c("A", "C", "B", "B"),
                 broad_category = c("set1", "set1", "set2", "set2"),
                 frame_id = c(1, 1, 1, 1))
df
#  EventID event_type ID_number x     y     identity_type broad_category frame_id
#1 1003    893        80427     86.07 35.58 A             set1           1
#2 1003    893        2346      72.40 26.43 C             set1           1
#3 1003    893        24954     43.08 34.80 B             set2           1
#4 1003    893        27765     80.13 34.79 B             set2           1

The expected result would return 5.992303 for row 1 it looks for all the entries not belonging to set1 with the same EventID, event_type, and frame_id and then returns the minimum euclidian distance given those parameters.

Also, I want to do this for every entry with identity type A. But, the identity_type and broad_category are not always tied together. A can belong to either set1 or set2.


Solution

  • While I'm not sure about your criteria, it seems that you MUST use for loops in some way if you want to iterate. I'm sure others can provide you with Rcpp solutions that are very quick. In the meantime, here is one possible way with base R.

    # In the future, please provide the code to create your example data
    dat <- structure(list(EventID = c(1003L, 1003L, 1003L, 1003L), 
                      event_type = c(893L, 893L, 893L, 893L), 
                      ID_number = c(80427L, 2346L, 24954L, 27765L), 
                      x = c(86.07, 72.4, 43.08, 80.13), 
                      y = c(35.58, 26.43, 34.8, 34.79), 
                      identity_type = structure(c(1L, 3L, 2L, 2L), 
                                                .Label = c("A", "B", "C"), 
                                                class = "factor"), 
                      broad_category = structure(c(1L,  1L, 2L, 2L), 
                                                 .Label = c("set1", "set2"), 
                                                 class = "factor"), 
                      frame_id = c(1L,  1L, 1L, 1L)), 
                 .Names = c("EventID", "event_type", "ID_number","x", "y", 
                            "identity_type", "broad_category", "frame_id"), 
                 class = "data.frame", row.names = c("1", "2", "3", "4"))
    
    # Define your criteria here
    dat$uniqueID <- paste0(dat$EventID, dat$event_type, dat$frame_id, dat$broad_category)
    # made your function have two 2 dim vectors instead since that's simpler for passing in
    two_dim_euclid = function(a, b) return(sqrt((a[1] - b[1])^2 + (a[2] - b[2])^2))
    
    n <- nrow(dat)
    vec <- numeric(n)
    for(i in 1:n){
      vec[i] = sum(apply(dat[dat$uniqueID != dat$uniqueID[i], c("x","y")], 1, 
                         function(r) two_dim_euclid(dat[i,c("x","y")], r)), na.rm = T)
      if(i%%10000 == 0) cat(i,"completed...\n") # Progress check since >1mil rows
    }
    dat$result <- vec