Search code examples
rdata.table

How to make functions run faster?


I have written the function find_one, whose function is to find out the value of the variable that appears repeatedly, but it runs slowly when the number of rows in the data frame is greater than 10,000. I want to modify this function to reduce the running time. Is there any way to achieve this effect? I am a beginner of R, and this function was provided by AI. It is difficult for me to figure out the problem with it. Thank you very much for your help and correction!

find_one <- function(x, y, z, df){
  library(data.table)
  start_time <- Sys.time() 
  
  data <- data.table(
    id = df[[x]],
    name = df[[y]],
    n = df[[z]]
  )
  
  # Step 1: Create bidirectional relationships with "n"
  relationships <- unique(rbind(
    data[, .(key = id, value = name, n = n)], # id -> name
    data[, .(key = name, value = id, n = n)] # name -> id
  ))
  
  # Step 2: Build clusters of relationships
  graph <- relationships[, .(key, value, n)]
  clusters <- list()
  while (nrow(graph) > 0) {
    # Initialize a cluster
    cluster <- graph[1]
    graph <- graph[-1]
    
    # Expand the cluster
    while (TRUE) {
      new_links <- graph[key %in% cluster$value | value %in% cluster$key]
      if (nrow(new_links) == 0) break
      
      cluster <- rbind(cluster, new_links)
      graph <- graph[!(key %in% cluster$key & value %in% cluster$value)]
    }
    clusters <- append(clusters, list(cluster))
  }
  
  # Step 3: Extract valid records (optimized)
  valid_records <- lapply(clusters, function(cluster) {
    # Identify rows corresponding to `id` and `name` in one go
    is_id <- cluster$key %in% data$id
    is_name <- cluster$value %in% data$name
    
    ids <- unique(cluster$key[is_id])
    names <- unique(cluster$value[is_name])
    ns <- unique(cluster$n)  # Row indices
    
    list(id = ids, name = names, n = ns)
  })
  
  # Step 4: Convert to the desired structure
  result <- tibble(
    id = lapply(valid_records, function(x) as.character(x$id)),
    name = lapply(valid_records, function(x) as.character(x$name)),
    n = lapply(valid_records, function(x) as.integer(x$n))
  )
  
  end_time <- Sys.time() 
  (end_time - start_time) %>% print() 
  
  return(result)
}
data = data.frame(
  v1 = c("a", "a", "a", "b", "c", "c", "d", "e"),
  v2 = c("123", "123", "124", "124", "125", "126", "127", "128"),
  v3 = 1:8
)

desired_result <- find_one(x = "v1", y = "v2", z = "v3", df = data)

Desired outcome:

desired_result  <- structure(list(id = list(c("a", "b"), "c", "d", "e"), name = list( c("123", "124"), c("125", "126"), "127", "128"), n = list( 1:4, 5:6, 7L, 8L)), class = c("tbl_df", "tbl", "data.frame" ), row.names = c(NA, -4L))

Solution

  • You could solve this using igraph:

    a <- igraph::components(igraph::graph_from_data_frame(data))$membership
    b <- aggregate(data, list(a[as.character(data[[1]])]), \(x)unique(x))[-1]
    names(b) <- c('id', 'name', 'n')
    
        id     name          n
    1 a, b 123, 124 1, 2, 3, 4
    2    c 125, 126       5, 6
    3    d      127          7
    4    e      128          8
    

    Note that aggregate is supper slow. I would rather just leave the data with the grouping factor than nesting it.


    You could check equivalency:

    all.equal(desired_result, structure(b, class = class(desired_result)))
    [1] TRUE
    

    If you want tidyverse:

    a <- igraph::components(igraph::graph_from_data_frame(data))$membership
    
    data %>%
       group_by(a = a[as.character(v1)]) %>%
       summarise(across(everything(), ~list(unique(.x))))
    
    # A tibble: 4 × 4
          a v1        v2        v3       
      <dbl> <list>    <list>    <list>   
    1     1 <chr [2]> <chr [2]> <int [4]>
    2     2 <chr [1]> <chr [2]> <int [2]>
    3     3 <chr [1]> <chr [1]> <int [1]>
    4     4 <chr [1]> <chr [1]> <int [1]>