Search code examples
rtextmatrixsimilarity

Recognize/differentiate two sentences by R


Here is an example of my data

id address

Table1:User table
id     address
1      mont carlo road,CA
2      mont road,IS
3      mont carlo road1-11,CA

Table 2(The output I wanna get)
Similarity Matrix
id   1    2    3

1  

2    3  

3    1    3

1~3 very similar~very dissimilar

My problem is how to recognize the similarity between the case by address in the Table 1, and then output a result, say Similarity Matrix like Table 2 in R. The point is how to figure out the comparison between two sentences in R and then set a scale to measure the similarity between a pair, finally output a matrix.


Solution

  • I'd also use the stringdist package but would make use of outer and cut to finish the job:

    library(stringdist)
    dat <- data.frame(
        address = c("mont carlo road,CA", "mont road,IS", "mont carlo road1-11,CA"),
        id = 1:3
    )
    
    m <- outer(dat[["address"]], dat[["address"]], stringdist, method="jw")
    
    m[lower.tri(m)] <- cut(m[lower.tri(m)], 3, labels=1:3)
    m[upper.tri(m)] <- cut(m[upper.tri(m)], 3, labels=1:3)
    dimnames(m) <- list(dat[["id"]], dat[["id"]])
    diag(m) <- NA
    m
    
    ##    1  2  3
    ## 1 NA  3  1
    ## 2  3 NA  3
    ## 3  1  3 NA
    

    You can use whatever method you want for calculating distance (?stringdist).