I have written the function find_one
, whose function is to find out the value of the variable that appears repeatedly, but it runs slowly when the number of rows in the data frame is greater than 10,000. I want to modify this function to reduce the running time. Is there any way to achieve this effect?
I am a beginner of R, and this function was provided by AI. It is difficult for me to figure out the problem with it.
Thank you very much for your help and correction!
find_one <- function(x, y, z, df){
library(data.table)
start_time <- Sys.time()
data <- data.table(
id = df[[x]],
name = df[[y]],
n = df[[z]]
)
# Step 1: Create bidirectional relationships with "n"
relationships <- unique(rbind(
data[, .(key = id, value = name, n = n)], # id -> name
data[, .(key = name, value = id, n = n)] # name -> id
))
# Step 2: Build clusters of relationships
graph <- relationships[, .(key, value, n)]
clusters <- list()
while (nrow(graph) > 0) {
# Initialize a cluster
cluster <- graph[1]
graph <- graph[-1]
# Expand the cluster
while (TRUE) {
new_links <- graph[key %in% cluster$value | value %in% cluster$key]
if (nrow(new_links) == 0) break
cluster <- rbind(cluster, new_links)
graph <- graph[!(key %in% cluster$key & value %in% cluster$value)]
}
clusters <- append(clusters, list(cluster))
}
# Step 3: Extract valid records (optimized)
valid_records <- lapply(clusters, function(cluster) {
# Identify rows corresponding to `id` and `name` in one go
is_id <- cluster$key %in% data$id
is_name <- cluster$value %in% data$name
ids <- unique(cluster$key[is_id])
names <- unique(cluster$value[is_name])
ns <- unique(cluster$n) # Row indices
list(id = ids, name = names, n = ns)
})
# Step 4: Convert to the desired structure
result <- tibble(
id = lapply(valid_records, function(x) as.character(x$id)),
name = lapply(valid_records, function(x) as.character(x$name)),
n = lapply(valid_records, function(x) as.integer(x$n))
)
end_time <- Sys.time()
(end_time - start_time) %>% print()
return(result)
}
data = data.frame(
v1 = c("a", "a", "a", "b", "c", "c", "d", "e"),
v2 = c("123", "123", "124", "124", "125", "126", "127", "128"),
v3 = 1:8
)
desired_result <- find_one(x = "v1", y = "v2", z = "v3", df = data)
Desired outcome:
desired_result <- structure(list(id = list(c("a", "b"), "c", "d", "e"), name = list( c("123", "124"), c("125", "126"), "127", "128"), n = list( 1:4, 5:6, 7L, 8L)), class = c("tbl_df", "tbl", "data.frame" ), row.names = c(NA, -4L))
You could solve this using igraph
:
a <- igraph::components(igraph::graph_from_data_frame(data))$membership
b <- aggregate(data, list(a[as.character(data[[1]])]), \(x)unique(x))[-1]
names(b) <- c('id', 'name', 'n')
id name n
1 a, b 123, 124 1, 2, 3, 4
2 c 125, 126 5, 6
3 d 127 7
4 e 128 8
Note that aggregate
is supper slow. I would rather just leave the data with the grouping factor than nesting it.
You could check equivalency:
all.equal(desired_result, structure(b, class = class(desired_result)))
[1] TRUE
If you want tidyverse:
a <- igraph::components(igraph::graph_from_data_frame(data))$membership
data %>%
group_by(a = a[as.character(v1)]) %>%
summarise(across(everything(), ~list(unique(.x))))
# A tibble: 4 × 4
a v1 v2 v3
<dbl> <list> <list> <list>
1 1 <chr [2]> <chr [2]> <int [4]>
2 2 <chr [1]> <chr [2]> <int [2]>
3 3 <chr [1]> <chr [1]> <int [1]>
4 4 <chr [1]> <chr [1]> <int [1]>