Search code examples
rquantedaudpipe

Text Similarity using PoS tag


I want to calculate text similarity by using only the words of a specific POS tag. Currently I am calculating similarity using cosine method but it does not take into account POS tagging.

A <- data.frame(name = c(
  "X-ray right leg arteries",
  "consultation of gynecologist",
  "x-ray leg arteries",
  "x-ray leg with 20km distance"
), stringsAsFactors = F)

B <- data.frame(name = c(
  "X-ray left leg arteries",
  "consultation (inspection) of gynecalogist",
  "MRI right leg arteries",
  "X-ray right leg arteries with special care"
), stringsAsFactors = F)

corp1 <- corpus(A, text_field = "name")
corp2 <- corpus(B, text_field = "name")

docnames(corp1) <- paste("A", seq_len(ndoc(corp1)), sep = ".")
docnames(corp2) <- paste("B", seq_len(ndoc(corp2)), sep = ".")

dtm3 <- rbind(dfm(corp1, ngrams=2), dfm(corp2, ngrams=2))
cosines <- lapply(docnames(corp2), 
                  function(x) textstat_simil(dtm3[c(x, docnames(corp1)), ],
                                             method = "cosine",
                                             selection = x)[-1, , drop = FALSE])
do.call(cbind, cosines)

In the above example, "X-ray right leg arteries" should not be mapped to "MRI right leg arteries" as these are two different categories of services. Unfortunately, I don't have explicit categorization of services. I only have services text. Is it possible by using POS tagging I can assign more importance to these words - "X-ray", "consultation", "leg" and "arteries". The services mentioned in the code are just a sample. In reality, I have more than 10K services. I explored udpipe package for PoS tagging but didn't get much success.


Solution

  • In order to do pos tagging with udpipe, you can do as follows (based on your example data A & B).

    library(udpipe)
    library(magrittr)
    library(data.table)
    txt <- rbindlist(list(A = A, B = B), idcol = "dataset")
    txt$id <- sprintf("dataset%s_id%s", txt$dataset, seq_len(nrow(txt)))
    
    # Tag using udpipe version 0.6 on CRAN which allows to show annotation progress
    udmodel <- udpipe_download_model("english")
    udmodel <- udpipe_load_model(udmodel$file_model)
    txt_anno <- udpipe_annotate(udmodel, x = txt$name, doc_id = txt$id, trace = 5)
    txt_anno <- as.data.table(txt_anno)
    

    If you want to calculate similarities based on a document term matrix of the lemma's, do as follows (uses sim2 from text2vec R package)

    # construct DTM with only nouns based on lemmas
    dtm1 <- subset(txt_anno, upos %in% c("NOUN"), select = c("doc_id", "lemma")) %>% 
      document_term_frequencies %>% 
      document_term_matrix
    library(text2vec)
    sim2(dtm1, dtm1, method = "cosine")
    

    If you also want to add ngrams of nouns in the game, do as follows. Extract nouns following one another, create a document/term/matrix of this new compound term and combine it with the exising document term matrix in order to easily run document similarities.

    # Add ngrams of nouns in the game (2 nouns following one another with an optional punctuation in between)
    keyw <- txt_anno[, keywords_phrases(x = upos, term = lemma, pattern = "NOUN(PUNCT)*NOUN", is_regex = TRUE), by = "doc_id"]
    keyw <- keyw[, list(freq = .N), by = c("keyword", "ngram")]
    
    # add a new column of this n-gram and create DTM
    txt_anno <- txt_anno[, term := txt_recode_ngram(x = lemma, compound = keyw$keyword, ngram = keyw$ngram), by = "doc_id"]
    
    dtm2 <- subset(txt_anno, term %in% keyw$keyword, select = c("doc_id", "term")) %>% 
      document_term_frequencies %>% 
      document_term_matrix
    
    dtmcombined <- dtm_cbind(dtm1, dtm2)
    colnames(dtmcombined)
    sim2(dtmcombined, dtmcombined, method = "cosine")