Search code examples
rfor-loopconditional-statementsstring-matchinglevenshtein-distance

Conditional for loop in R not recognizing conditional statement?


Assume that I have the following similar data structure, where doc_id is the document identifier, text_id is the unique text/version identifier and text is a character string:

df <- cbind(doc_id=as.numeric(c(1, 1, 2, 2, 3, 4, 4, 4, 5, 6)), 
                text_id=as.numeric(c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10)), 
                  text=as.character(c("string1", "str2ing", "3string", 
                                      "string6", "s7ring", "string8", 
                                      "string9", "string10")))

What I am attempting to do in the loop structure is do string edit-distance comparisons, but only for different versions of the same documents. In short, I want to find matching doc_ids and pair-wise compare only different versions (text_ids) of the same document.

#Results matrix
result <- matrix(ncol=10, nrow=10)

#Loop
i=1
for (j in 1:length(df[,2])) {
  for (i in 1:length(df[,2])) {
#Conditional Statements
    if(df[i,1]==df[j,1]){
      result[i,j]<-levenshteinDist(df[j,3], df[i,3])}
    else(result[i,j]<-"Not Compared")
  }
  print(result[i,j])
  flush.console()
}

Returns:

[1] "Not Compared"
[1] "Not Compared"
[1] "Not Compared"
[1] "Not Compared"
[1] "Not Compared"
[1] "Not Compared"
[1] "Not Compared"
[1] "Not Compared"
[1] "Not Compared"
[1] "0"

The levenshteinDist() function can be found in the RecordLinkage package, but a similar function is also bundled in the utils package as adist()

My question is: why is my first conditional statement (if) being ignored, and only the else portion being returned?

Any further advice on coding or processing time efficiency gains will be greatly appreciated.


Solution

  • You're not outputting correctly. Run this version and see the comparisons happening in place. Comment out the message() once you are satisfied that everything is working correctly.

    library(RecordLinkage)
    
    df <- structure(c("1", "1", "2", "2", "3", "4", "4", "4", "5", "6", 
    "1", "2", "3", "4", "5", "6", "7", "8", "9", "10", "string1", 
    "str2ing", "3string", "string6", "s7ring", "string8", "string9", 
    "string10", "string1", "str2ing"), .Dim = c(10L, 3L), .Dimnames = list(
        NULL, c("doc_id", "text_id", "text")))
    
    result <- matrix(ncol = 10, nrow = 10)
    # nrow() and ncol() are more elegant ways of getting row/column counts.
    for(j in 1:nrow(df)) {
        for(i in 1:nrow(df)) {
            message(sprintf("comparing i=%s (%s), j=%s (%s)\n", j, df[i, 1], i, df[j, 1]))
            if(identical(df[i, 1], df[j, 1])) {
                result[i, j] <- levenshteinDist(df[j, 3], df[i, 3])
            } else {
                result[i, j] <- "Not Compared"
            }
               # printing inside the inner for loop
            print(result[i, j])
        }
    
    }