Search code examples
rnlplucenebioinformaticsquanteda

Equivalent of Apache Lucene "proximity searches" in R


I'm working on a corpus of documents (clinical narratives from hospital stays), mainly using the Quanteda package. The objective is to be able to classify documents based on the presence/absence of a feature, let's say "spastic cough".

I would like to be able to reproduce the behaviour of an Apache Lucene "proximity search" (https://lucene.apache.org/core/8_11_2/queryparser/org/apache/lucene/queryparser/classic/package-summary.html#Proximity_Searches) using R.

Let's take an example: "spastic and productive cough in a 91-year-old patient following femoral neck surgery"

I would begin tokenizing the phrase as follows:

toks = 
tokens(
c(text1 = "spastic and productive cough in a 91-year-old patient following femoral neck surgery"), 
remove_punct = T, remove_symbols = T, remove_numbers = T, padding = T
) %>% 
tokens_remove(pattern = stopwords("en",source = "nltk"))

which yields the following output:

Tokens consisting of 1 document.
text1 :
[1] "spastic"     "productive"  "cough"       "91-year-old" "patient"     "following"   "femoral"    
[8] "neck"        "surgery" 

I can then proceed to generate n-grams and skip-grams:

toks = tokens_ngrams(toks,n=4,skip = 0:3)

toks
[1] "spastic_productive_cough_91-year-old"     "spastic_productive_cough_patient"        
  [3] "spastic_productive_cough_following"       "spastic_productive_cough_femoral"        
  [5] "spastic_productive_91-year-old_patient"   "spastic_productive_91-year-old_following"
  [7] "spastic_productive_91-year-old_femoral"   "spastic_productive_91-year-old_neck"     
  [9] "spastic_productive_patient_following"     "spastic_productive_patient_femoral"      
 [11] "spastic_productive_patient_neck"          "spastic_productive_patient_surgery"      
 [13] "spastic_productive_following_femoral"     "spastic_productive_following_neck"       
 [15] "spastic_productive_following_surgery"     "spastic_cough_91-year-old_patient"       
 [17] "spastic_cough_91-year-old_following"      "spastic_cough_91-year-old_femoral"       
 [19] "spastic_cough_91-year-old_neck"           "spastic_cough_patient_following"         
 [21] "spastic_cough_patient_femoral"            "spastic_cough_patient_neck"              
 [23] "spastic_cough_patient_surgery"            "spastic_cough_following_femoral"         
 [25] "spastic_cough_following_neck"             "spastic_cough_following_surgery"         
 [27] "spastic_cough_femoral_neck"               "spastic_cough_femoral_surgery"           
 [29] "spastic_91-year-old_patient_following"    "spastic_91-year-old_patient_femoral"     
 [31] "spastic_91-year-old_patient_neck"         "spastic_91-year-old_patient_surgery"     
.........

At this point i guess i could simply:

any(str_detect(as.character(toks),"spastic_cough"))
[1] TRUE

but I'm not sure I'm using the correct approach as it feels clunky compared to how a Lucene query would work. If I were trying to identify patients with "spastic cough" using Apache Lucene to query the corpus I may use something like "spastic cough"~3 where "~3" means that any skip-gram 0:3 would match.

Any input about how and where I could improve my method?

EDIT:

This may do the trick: https://search.r-project.org/CRAN/refmans/corpustools/html/search_features.html

but, at the moment, I can't figure out how to include it in the workflow.

EDIT 2:

It seems like i can query the corpus using subset_query using a Lucene like syntax. The big problem i'm facing now is that "corpustools" isn't accepting as input tokens object and the function tokens_to_corpus() isn't working for me. This prevents me from being able to control the tokenization process


Solution

  • Actually, after delving deeper into the documentation, the "corpustools" package offers all I need for an Apache Lucene like experience in R =)