r dataframe matrix datatable edit-distance

implementation in R: Finding the distance between two lists of characters

I am new to R and trying to understand how I can implement the algorithm below in R. I have two lists of characters and want to see what is the minimum distance between these two.

List 1: "a", "b", "c"
List 2: "a", "b", "c", "d"

At the first step: I have created a table like this :

  a   b   c
a 0   1   2
b 1
c 2
d 3

At the second step: I filled the rest of the matrix with '0's

  a   b   c
a 0   1   2
b 1   0   0
c 2   0   0
d 3   0   0

Now, I want to start calculating the distance among these two lists by this algorithm and update the matrix:

if (characters_in_header_of_matrix[i]==characters_in_column_of_matrix [j] & value[i,j] == value[i+1][j-1] )
then {get the 'diagonal value' #diagonal value= value[i, j-1]}

else{
value[i,j] = min(value[i-1, j], value[i-1, j-1],  value[i, j-1]) + 1
}
endif

for finding the difference between two liststhat you can see in the header and the column of the matrix, I have used the strcmp() function. But, I fail at implementing this. The final result should look like this :

  a   b   c
a 0   1   2
b 1   0   1
c 2   1   0
d 3   2   1

I would appreciate your help.Thanks

Solution

In R, there are a few things that you can do brute-force with for loops and conditionals, but they might easily be done in a vectorized method. The benefit might be speed (though might not) but can often be appreciated in the simpler code and (once you can grok the functions) readability and maintainability.

Take for example this problem:

l1 <- c("a", "b", "c")
l2 <- c("a", "b", "c", "d")

You want to find the "distance" (absolute distance, to be specific) between each letter in l1 from each letter in l2. The outer function does an "outer product" of the two vectors. For instance, if we were to constructively (not actually) do outer(a:c, 1:3), it would pair a1, a2, a3, b1, ..., c3. (That is not legal R code, just used for demonstration, though it can be done quite easily with a couple of minor additions.)

In our case, if we do outer(l1, l2), the function it uses defaults to multiplication (*), since its initial use is often in linear algebra, but this function can easily be overridden with FUN=. Internally, what it is doing is creating two (much longer) vectors doing all of the pairing. We can see what is happening under the hood if we introduce a debugging function to inspect the state.

debugfunc <- function(a, b) { browser(); 1; }

(The 1 is there solely as a place-holder.)

outer(l1, l2, FUN=debugfunc)
# Called from: FUN(X, Y, ...)
# Browse[2]> 
a # <--- the object 'a' here is the first argument to this function
#  [1] "a" "b" "c" "a" "b" "c" "a" "b" "c" "a" "b" "c"
# Browse[2]> 
b # <--- the second argument
#  [1] "a" "a" "a" "b" "b" "b" "c" "c" "c" "d" "d" "d"

In order, this pairs "a" with "a", then "b" with "a", then "c" with "a", etc. It exhausts the first (l1) vector and then increments the second vector, repeating until both are exhausted. At this point, the debugfunc is called exactly once with these two vectors (not once per pair, as some might suspect), so your FUN= function must be able to do all operations in one call.

One might want to look at the distances here. You can determine an individual letter's position in the alphabet by using match("a", letters) (the companion LETTERS is all-caps). In general, match finds the position of the first argument(s) in the second. So continuing within debugfunc:

# Browse[2]> 
match(a, letters)
#  [1] 1 2 3 1 2 3 1 2 3 1 2 3
# Browse[2]> 
match(b, letters)
#  [1] 1 1 1 2 2 2 3 3 3 4 4 4

So what you want is really the difference between those two numeric vectors. One could easily do:

# Browse[2]> 
match(a, letters) - match(b, letters)
#  [1]  0  1  2 -1  0  1 -2 -1  0 -3 -2 -1

but since we need the absolute distance, we really need

# Browse[2]> 
abs( match(a, letters) - match(b, letters) )
#  [1] 0 1 2 1 0 1 2 1 0 3 2 1

Ok, so I think we have our function here. Let's break out of the debugger (Q) and try this a little more formally:

distfunc <- function(a, b) abs( match(a, letters) - match(b, letters) )
outer(l1, l2, FUN=distfunc)
#      [,1] [,2] [,3] [,4]
# [1,]    0    1    2    3
# [2,]    1    0    1    2
# [3,]    2    1    0    1

Note that the first argument becomes the rows, so l1 with a length of 3 gives us 3 rows. If you need to apply the row/column names, then:

o <- outer(l1, l2, FUN=distfunc)
dimnames(o) <- list(l1, l2)
o
#   a b c d
# a 0 1 2 3
# b 1 0 1 2
# c 2 1 0 1

(Changing the order of the arguments will give you precisely the matrix you are seeking.)