Search code examples
rforeachdplyradjacency-matrixdoparallel

Improve nested for loop speed R creating adjacency matrix


I want to create an adjacency matrix. The matrix should show how many identical Value-strings there are per keyword.

My current approach with two for loops takes a while when working with more data. I have looked into foreach package but could not get my head around for this example. I would appreciate your help for any speed gains ;)

#create reproducible dataset
set.seed(11)
x <- rep('keyword', 10)
y <- seq(1, 10)
z <- rep('value', 10)

df <- tibble::tibble(Keyword = rep(paste0(x,y),4), 
                     Values = paste0(sample(z, 40, replace = TRUE), 
                                     sample(y, 40, replace = TRUE)))

#format dataset
temp_df <- df %>% 
  dplyr::group_by(Keyword) %>%
  dplyr::summarise(Values = toString(Values))  %>%
  dplyr::ungroup() %>%
  dplyr::distinct(Keyword, .keep_all = TRUE)

#initialise adjacency matrix
adj_matrix <- data.frame()

#for loops to iterate through values
for (i in 1:nrow(temp_df)) {

  y <- trimws(unlist(strsplit(temp_df$Values[i], split = ',')))

  for (g in i:nrow(temp_df)) {

    f <- trimws(unlist(strsplit(temp_df$Values[0+g], split = ',')))
    z <- y %in% f
    adj_matrix[i,g] <- sum(z)

  }
}

#name rows and columns
colnames(adj_matrix) <- temp_df$Keyword
rownames(adj_matrix) <- temp_df$Keyword

The adj_matrix is sparse (i.e. just half of it is filled) and you can see which keyword shares how many identical Value-strings. With this matrix, I can easily display relationships in a network diagram.

Thanks in advance!

Jan


Solution

  • It's not clear to me whether your real data starts as df or temp_df. Either way you can avoid the processing in the nested loop by using outer() which should speed things up somewhat.

    library(dplyr)
    library(purrr)
    
    am_outer <- df %>%
      split(f = .$Keyword) %>%
      map(pull, Values) %>%
      outer(., ., function(x,y) sapply(seq_along(x), function(i) sum(x[[i]] %in% y[[i]])))
    
    am_outer[lower.tri(am_outer)] <- NA
    
    identical(data.frame(am_outer), adj_matrix)
    
    [1] TRUE
    

    If the data begins as temp_df you can use:

    temp_df %>% 
      separate_rows(Values, sep = ", ") %>% 
      split(f = .$Keyword) %>%
      map(pull, Values) %>%
      outer(., ., function(x,y) sapply(seq_along(x), function(i) sum(x[[i]] %in% y[[i]])))