Search code examples
relasticsearch

In Elasticsearch, how to exclude search results that 'cross' sentences (full stops)


I have 4 documents indexed by Elasticsearch(using the libraryelastic in R.

library(elastic) connection <- connect(errors = "complete") indexName <- "test" index_create(connection,indexName)

text <- data.frame("full_text"= "this is a brown dog. that likes to be.") docs_bulk(connection,text,indexName) text <- data.frame("full_text"= "This is a brown dog. that likes to be.") docs_bulk(connection,text,indexName) text <- data.frame("full_text"= "THIS is a brown dog. that likes to be.") docs_bulk(connection,text,indexName) text <- data.frame("full_text"= "THIS a brown dog is. that likes to be.") docs_bulk(connection,text,indexName) text <- data.frame("full_text"= "ia THIS a brown dog. that likes to be.") docs_bulk(connection,text,indexName) text <- data.frame("full_text"= "is THIS a brown dog. that likes to be.") docs_bulk(connection,text,indexName) text <- data.frame("full_text"= "is THIS a brown dog. that likse to be.") docs_bulk(connection,text,indexName) text <- data.frame("full_text"= "is THIS a brown dog. that like to be.") docs_bulk(connection,text,indexName) text <- data.frame("full_text"= "is THIS a brown dog. that likes to be.") docs_bulk(connection,text,indexName) text <- data.frame("full_text"= "is apple THIS a brown dog. thats like to be.") docs_bulk(connection,text,indexName) text <- data.frame("full_text"= "is THIS a brown dog. apple that likes to be.") docs_bulk(connection,text,indexName) text <- data.frame("full_text"= "is THIS a brown dog. apple that likes to be.") docs_bulk(connection,text,indexName) text <- data.frame("full_text"= "is THIS a brown dog apple that likes to be.") docs_bulk(connection,text,indexName)

By using the search query: 'dog MUST PRECEDENT that, WITH A MAX_GAP OF 3, BUT MUST NOT INCLUDE .' I want to find (return) only the last document.

I tried the following search query, but this does not work, that's all 12 documents are returned, because -- if I'm correct -- periods are not indexed by Elasticsearch.

query <- '{ "from" : 0, "size" : 10000, "query": { "bool": { "must": [ { "intervals" : { "full_text" : { "all_of" : { "ordered" : true, "intervals" : [ { "match" : { "query" : "dog", "max_gaps" : 0, "ordered" : true } }, { "any_of" : { "intervals" : [ { "match" : { "query" : "likes", "max_gaps" : 0, "ordered" : true

                }
              ]
            }
          }
        ],
        "max_gaps" : 3
      }
    }
  }
    }
  ],
  "must_not": {
    "match": {
      "full_text": "."
    }
  }
}

} }' Search(connection,indexName,body=query)

Next, I included the whitespace analyser before indexing the documents:

index_create(connection,indexName) mapping<- ' {"properties": { "full_text":{ "type":"text", "analyzer": "whitespace"} } } ' mapping_create( connection, indexName, body=mapping )

Interestingly, two documents are returned: the last one (correct), and this one "THIS a brown dog is. that likes to be."

I assume that this this document is returned because the . is included in is. and not in that.

Any hints on how to proceed are very welcome. If I need to provide more info, just let me know. Thank you.


Solution

  • Thanks to input from several users on reddit, I was able to solve the problem.

    See: https://www.reddit.com/r/elasticsearch/comments/15z3g97/in_elasticsearch_how_to_exclude_search_results/?sort=new

    Will post a clean example later here.