Search code examples
stringrsimilaritysequence-alignment

How do I group similar strings in R?


I have a database with ~5,000 locality names, most of which are repetitions with typos, permutations, abreviations, etc. I would like to group them by similarity, to speed up further processing. The best would be to convert each variation into a "platonic form", and put two columns side by side, with the original and platonic forms. I've read about Multiple sequence alignment, but this seems to be mostly used in bioinformatics, for sequences of DNA/RNA/Peptides. I'm not sure it will work well with names of places. Anyone knows of a library that helps me to do it in R? Or which of the many algorithm variations might be easier to adapt?

EDIT: How do I do that in R? Up to now, I'm using adist() function, which gave me a matrix of distances between each pair of strings (although it don't treat translocations the way I think it should, see comment below). The next step I'm working right now is to turn this matrix into a grouping/clustering of similar enough values. Thanks in advance!

EDIT: To solve the translocations problem, I did a small function that gets all the words with more than 2 characters, sort them, removes any punctuation left, and paste them again into a string.

sep <- function(linha) {
    resp <- strsplit(linha," |/|-")
    resp <- unlist(resp)
    resp <- gsub(",|;|\\.","",resp)
    resp <- sort(resp[which(nchar(resp) > 2)])
    paste0(resp,collapse=" ")
}

Then I apply this over all lines of my table

locs[,9] <- apply(locs,1,function(x) sep(x[1])) # 1=original data; 9=new data

and finally apply adist() to create the similarity table.


Solution

  • There's a built in function called "adist" that computes a measure of distance between two words.

    It's like using "agrep", except it returns the distance, instead of whether the words match according to some approximate matching criteria.

    For the special case of words that can be interchanged with a comma(e.g. "hello,world" should be close to "world,hello"), here's a quick hack. You can modify the function pretty easily if you have other special cases.

    adist_special <- function(word1, word2){
        min(adist(word1, word2),
            adist(word1, gsub(word2, 
                              pattern = "(.*),(.*)", 
                              repl="\\2,\\1")))
    }
    
    adist("hello,world", "world,hello")
    
     # 8
    adist_special("hello,world", "world,hello")
    
     # 0