Search code examples
rlevenshtein-distancecosine-similarityquanteda

Higher weightage to Prefix


Is there a way or distance method to assign higher Weightage to prefix while calculating similarity? I am aware of Jaro-Winkler method but its application is limited to characters. I am looking for similarity in words.

A <- data.frame(name = c(
  "X-ray right leg arteries",
  "Rgraphy left shoulder",
  "x-ray leg arteries",
  "x-ray leg with 20km distance"
), stringsAsFactors = F)

B <- data.frame(name = c(
  "X-ray left leg arteries",
  "Rgraphy right shoulder",
  "X-ray left shoulder",
  "Rgraphy right leg arteries"
), stringsAsFactors = F)

library(quanteda)
corp1 <- corpus(A, text_field = "name")
corp2 <- corpus(B, text_field = "name")

docnames(corp1) <- paste("A", seq_len(ndoc(corp1)), sep = ".")
docnames(corp2) <- paste("B", seq_len(ndoc(corp2)), sep = ".")

dtm3 <- rbind(dfm(corp1, ngrams=1:2), dfm(corp2, ngrams=2))
d2 <- textstat_simil(dtm3, method = "cosine", diag = TRUE)
as.matrix(d2)[docnames(corp1), docnames(corp2)]

I want "X-ray right leg arteries" from dataframeA should be mapped to "X-ray left leg arteries" from dataframeB instead of "Rgraphy right leg arteries". By that, I meant similarity score between "X-ray right leg arteries" and "X-ray left leg arteries" should be higher as compared to similarity between "X-ray right leg arteries" and "Rgraphy right leg arteries".

Similarly, I want "Rgraphy left shoulder" should be mapped to "Rgraphy right shoulder" instead of "X-ray left shoulder". The above example is just a sample. In reality, I have a big list and it's not just limited to "X-ray" and "Rgraphy". Hence I don't want to apply filter on "X-ray" and "Rgraphy" and then calculate similarity. It should be more algo based.


Solution

  • It sounds like you would like to preserve certain diagnostic procedures as features without respect to the exact wording used, so that these can form the basis for computing similarity between documents.

    You can do this by defining phrases in a dictionary, and applying that before you construct the dfm. Here, I have expanded your texts a bit to include additional features.

    A <- data.frame(text = c("Patient had X-ray right leg arteries.",
                             "Subject was administered Rgraphy left shoulder",
                             "Exam consisted of x-ray leg arteries",
                             "Patient administered x-ray leg with 20km distance."),
                    row.names = paste0("A", 1:4), stringsAsFactors = FALSE)
    B <- data.frame(text = c(B = "Patient had X-ray left leg arteries",
                             "Rgraphy right shoulder given to patient",
                             "X-ray left shoulder revealed nothing sinister",
                             "Rgraphy right leg arteries tested"), 
                    row.names = paste0("A", 1:4), stringsAsFactors = FALSE)
    

    Now, we can define a dictionary that includes the phrases that will match the phrases that you would like to consider as equivalent for the purposes of computing similarity. In this example, it does not matter whether an X-ray is for the right or left leg, or does not specify this. Similarity, we are not concerned with the "Rgraph" procedure being specific to the left or right shoulder. (Obviously, you will need to adjust and refine these according to exactly what is in your text, and what you are willing to consider as equivalent.)

    medicaldict <- dictionary(list(
        xray_leg = c("X-ray right leg arteries", "x-ray left leg arteries", 
                     "x-ray leg arteries"),
        rgraphy_leg = c("Rgraphy right leg arteries", "Rgraphy left leg arteries"),
        xray_shoulder = c("X-ray left shoulder", "X-ray right shoulder"),
        rgraphy_shoulder = c("Rgraphy left shoulder", "Rgraphy right shoulder")
    ))
    

    When we apply this to the tokens using tokens_lookup() in a "non-exclusive" way, the sequences are substituted by the dictionary keys. Note that because tokens_lookup() collapses the relevant token sequences as phrases, there is no longer a need to form token ngrams as in your question.

    toks <- tokens(corpus(A) + corpus(B)) %>%
        tokens_lookup(dictionary = medicaldict, exclusive = FALSE)
    toks
    # tokens from 8 documents.
    # A1 :
    # [1] "Patient"  "had"      "XRAY_LEG" "."       
    # 
    # A2 :
    # [1] "Subject"          "was"              "administered"     "RGRAPHY_SHOULDER"
    # 
    # A3 :
    # [1] "Exam"      "consisted" "of"        "XRAY_LEG" 
    # 
    # A4 :
    # [1] "Patient"      "administered" "x-ray"        "leg"          "with"         "20km"         "distance"     "."           
    # 
    # A11 :
    # [1] "Patient"  "had"      "XRAY_LEG"
    # 
    # A21 :
    # [1] "RGRAPHY_SHOULDER" "given"            "to"               "patient"         
    # 
    # A31 :
    # [1] "XRAY_SHOULDER" "revealed"      "nothing"       "sinister"     
    # 
    # A41 :
    # [1] "RGRAPHY_LEG" "tested"     
    

    Finally, we can compute document similarity based on the collapsed features, rather than the original bag of words.

    dfm(toks) %>%
        textstat_simil(method = "cosine", diag = TRUE)
    #            A1        A2        A3        A4       A11       A21       A31
    # A2  0.0000000                                                            
    # A3  0.2500000 0.0000000                                                  
    # A4  0.3535534 0.1767767 0.0000000                                        
    # A11 0.8660254 0.0000000 0.2886751 0.2041241                              
    # A21 0.2500000 0.2500000 0.0000000 0.1767767 0.2886751                    
    # A31 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000          
    # A41 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000