Search code examples
rdata.tablebioconductorstringdistfuzzyjoin

How to maximize R fuzzyjoin/stringdist speed and memory efficiency


I have 2 data frames containing short (length == 20) sequences that I want to compare with string distance analysis techniques, returning highly similar sequences with a hamming distance of no greater than 3 (i.e. no more than 3 substitutions between the query and subject sequences). fuzzyjoin::stringdist_join() accomplishes this task well, but it can't handle the quantity of sequences I want to compare (tens to hundreds of thousands of sequences in each data frame) unless I chunk in the query sequences. When my data frames are on the larger side of things, this tactic starts taking full days to execute with the code below.

Is there some way I can use fuzzyjoin or stringdist packages with data.table to speed things up and preserve memory? I keep trying various things but they result in even slower execution speed.

library(tidyverse)
library(fuzzyjoin)

### simulate data ###

chars <- c("A", "C", "G", "T")
nq <- 50051
ns <- 54277
query <- data.frame(name = str_c("q", 1:nq), 
                    seq = replicate(nq, sample(chars, 20, replace = T) %>% paste0(collapse = "")))
subject <- data.frame(name = str_c("s", 1:ns),
                      seq = replicate(ns, sample(chars, 20, replace = T) %>% paste0(collapse = "")))

### return seqs with 3 or less mismatches ###

# # NOT ENOUGH MEMORY
# stringdist_join(query, subject,
#                 by = "seq",
#                 method = "hamming",
#                 mode = "left",
#                 max_dist = 3,
#                 distance_col = "mismatches")

# chunk query values to preserve memory
query <- query %>%
  mutate(grp = (plyr::round_any(row_number(), 100)/100)+1)

# get a variable of all groups
var.grps <- unique(query$grp)

# create an output list
df_out <- purrr::map_df(var.grps, function(i) {
  q <- filter(query, grp == i)
  dat <- stringdist_join(q, subject,
                         by = "seq",
                         max_dist = 3,
                         method = "hamming",
                         mode = "left",
                         ignore_case = TRUE,
                         distance_col = "mismatch")
  return(dat)
})

Solution

  • I figured it out: stringdist_join() uses stringdistmatrix() behind the scenes. It's much faster to just use stringdistmatrix() and collect your desired information from it. To overcome memory issues, I chunked in the query sequences with an initial empty matrix.

    # make stringdist matrix 
    chunk_size <- 1000
    num_rows <- nrow(query)
    
    # Initialize an empty matrix
    sdm <- matrix(0, nrow = num_rows, ncol = nrow(subject))
    
    # Loop through the rows in chunks
    for (start_row in seq(1, num_rows, by = chunk_size)) {
      end_row <- min(start_row + chunk_size - 1, num_rows)
      
      # Subset the rows for the current chunk
      chunk_query <- query$seq[start_row:end_row]
      
      # Compute stringdist matrix for the current chunk
      chunk_sdm <- stringdistmatrix(chunk_query, subject$seq, method = "hamming")
      
      # Assign the chunk_sdm to the corresponding rows in the main sdm matrix
      sdm[start_row:end_row, ] <- chunk_sdm
    }
    rownames(sdm) <- query$name
    colnames(sdm) <- subject$name
    
    # find the indices where the values are 3 or less
    indices <- which(sdm <= 3, arr.ind = TRUE)
    
    # extract row names, col names, and values based on the indices
    result <- data.frame(query = rownames(sdm)[indices[, 1]],
                         subject = colnames(sdm)[indices[, 2]],
                         mismatch = sdm[indices])