Search code examples
rperformancepositionset-difference

R: find and count all differences by position (of one element added, subtracted or substituted) between character vectors nested in a list


I have a list of character vectors representing words split in phonemes:

> head(words)
[[1]]
[1] "UU"

[[2]]
[1] "EY" "Z" 

[[3]]
[1] "T"  "R"  "IH" "P"  "UU" "L"  "EY"

[[4]]
[1] "AA" "B"  "ER" "G" 

[[5]]
[1] "AA" "K"  "UU" "N" 

[[6]]
[1] "AA" "K"  "ER"

For each word in the list, I would like to find the number of words that differ from the considered word by one phoneme (one phoneme added, subtracted or substituted) and have the same number of phonemes in the same positions. In this sense, for the word "EY" "Z" acceptable cases would be:

[1] "M"  "EY" "Z" 

[1] "AY" "Z"

[1] "EY" "D" 

[1] "EY" "Z" "AH"

But the following cases should be rejected:

[1] "EY" "D"  "Z"

[1] "Z" "EY" "D"

[1] "HH" "EY"

Basically, I would like to find differences of one element respecting the positions of the phonemes in the vectors. At the moment, the best solution I have found is:

diffs <- c()
for (i in seq_along(words)) {
  diffs <- c(diffs, sum(sapply(words, function(y) {
    count <- 0
    elements <- list(words[[i]], y)
    len <- c(length(words[[i]]), length(y))
    if (identical(elements[which(len==max(len))][[1]][-1], elements[which(len==min(len))][[1]]) == 1) {
      count + identical(elements[which(len==max(len))][[1]][-1], elements[which(len==min(len))][[1]])
    } else {
      length(elements[which(len==min(len))][[1]]) <- length(elements[which(len==max(len))][[1]])
      elements <- rapply(elements, f=function(x) ifelse(is.na(x),"$$",x), how="replace" )
      count + sum(elements[[1]] != elements[[2]])
    }
  })== 1))
}

However, this solution is taking ages because my list words has 120.000 elements (words/vectors), so I would like to ask if you know other solutions to speed up the process.

Thank you very much in advance for your answers


Solution

  • And a different answer, using regular Levenshtein distance (i.e. allowing insertions at any point), but this time FAST - 1000 words in 15 seconds fast.

    The trick is using one of the fast Levenshtein implementations available in R packages; in this case I'm using stringdist but any should work. The issue is that they operate on strings and characters, not multi-character phoneme representations. But there's a trivial solution for that: as there are more characters than phonemes, we can just translate the phonemes into single characters. The resulting strings are unreadable as phonemic transcriptions, but work perfectly fine as input to the neighborhood density algorithm.

    library(stringdist)
    
    phonemes <- unique(unlist(words))
    
    # add a few buffer characters
    targets <- c(letters, LETTERS, 0:9, "!", "§", "%", "&", "/", "=", 
                 "#")[1:length(phonemes)]
    
    ptmap <- targets
    names(ptmap) <- phonemes
    
    wordsT <- sapply(words, function(i) paste0(ptmap[i], collapse=""))
    
    wordlengths <- nchar(wordsT)
    
    onediffs.M <- function(x) {
      lengthdiff <-  abs(wordlengths - nchar(x))
      sum(stringdist(x, wordsT[lengthdiff == 0], method="hamming") == 1) +
        sum(stringdist(x, wordsT[lengthdiff == 1], method="lv") == 1)
    }