I have 4 documents indexed by Elasticsearch
(using the libraryelastic
in R
.
library(elastic) connection <- connect(errors = "complete") indexName <- "test" index_create(connection,indexName)
text <- data.frame("full_text"= "this is a brown dog. that likes to be.") docs_bulk(connection,text,indexName) text <- data.frame("full_text"= "This is a brown dog. that likes to be.") docs_bulk(connection,text,indexName) text <- data.frame("full_text"= "THIS is a brown dog. that likes to be.") docs_bulk(connection,text,indexName) text <- data.frame("full_text"= "THIS a brown dog is. that likes to be.") docs_bulk(connection,text,indexName) text <- data.frame("full_text"= "ia THIS a brown dog. that likes to be.") docs_bulk(connection,text,indexName) text <- data.frame("full_text"= "is THIS a brown dog. that likes to be.") docs_bulk(connection,text,indexName) text <- data.frame("full_text"= "is THIS a brown dog. that likse to be.") docs_bulk(connection,text,indexName) text <- data.frame("full_text"= "is THIS a brown dog. that like to be.") docs_bulk(connection,text,indexName) text <- data.frame("full_text"= "is THIS a brown dog. that likes to be.") docs_bulk(connection,text,indexName) text <- data.frame("full_text"= "is apple THIS a brown dog. thats like to be.") docs_bulk(connection,text,indexName) text <- data.frame("full_text"= "is THIS a brown dog. apple that likes to be.") docs_bulk(connection,text,indexName) text <- data.frame("full_text"= "is THIS a brown dog. apple that likes to be.") docs_bulk(connection,text,indexName) text <- data.frame("full_text"= "is THIS a brown dog apple that likes to be.") docs_bulk(connection,text,indexName)
By using the search query: 'dog MUST PRECEDENT that, WITH A MAX_GAP OF 3, BUT MUST NOT INCLUDE .'
I want to find (return) only the last document.
I tried the following search query, but this does not work, that's all 12 documents are returned, because -- if I'm correct -- periods are not indexed by Elasticsearch.
query <- '{ "from" : 0, "size" : 10000, "query": { "bool": { "must": [ { "intervals" : { "full_text" : { "all_of" : { "ordered" : true, "intervals" : [ { "match" : { "query" : "dog", "max_gaps" : 0, "ordered" : true } }, { "any_of" : { "intervals" : [ { "match" : { "query" : "likes", "max_gaps" : 0, "ordered" : true
} ] } } ], "max_gaps" : 3 } } } } ], "must_not": { "match": { "full_text": "." } } }
} }' Search(connection,indexName,body=query)
Next, I included the whitespace analyser before indexing the documents:
index_create(connection,indexName) mapping<- ' {"properties": { "full_text":{ "type":"text", "analyzer": "whitespace"} } } ' mapping_create( connection, indexName, body=mapping )
Interestingly, two documents are returned: the last one (correct), and this one "THIS a brown dog is. that likes to be."
I assume that this this document is returned because the .
is included in is.
and not in that
.
Any hints on how to proceed are very welcome. If I need to provide more info, just let me know. Thank you.
Thanks to input from several users on reddit, I was able to solve the problem.
Will post a clean example later here.