Search code examples
elasticsearchsearchluceneelasticsearch-phpfast-vector-highlighter

Return position and highlighting of search queries in Elasticsearch


I am using the official Elasticsearch-PHP client installed on a personal Debian server, and what I am trying to do involves indexing, searching and highlighting individual documents. i.e. each search result will only return one document - which will then be highlighted for "simple query string" searches. I am also using FVH (fast vector highlighting).

My question is similar to this one Position as result, instead of highlighting and the test code is basically the same so I won't repeat that here. However in my case I need both position and highlighting. I followed the link to the documentation about term vectors, but just like the other OP, my searches are not exact words per se. In some cases they are phrases. How would I approach this?

My use case is to search only one document (for each query), and present a summary of results with links which the user can click to go to the specific place in the document where that result came from. If I have the index / position I can simply use that against the full source of the document. I have checked the documentation to no avail.


Solution

  • You could try to install a specific plugin developed by wikimedia foundation called Experimental Highlighter -github here

    You can install for elasticsearch 7.5 in this way - for other elasticsearch versions please refer to the github project page:

    ./bin/elasticsearch-plugin install org.wikimedia.search.highlighter:experimental-highlighter-elasticsearch-plugin:7.5.1
    

    And restart elasticsearch.

    Inasmuch you need to retrieve also the positions - if for your use case the offsets can replace the positions please go on to the next paragraph - you should declare your field with termvector with the index option "with_position_offset_payloads" - doc here

    PUT /my-index-000001
    { "mappings": {
        "properties": {
          "text": {
            "type": "text",
            "term_vector": "with_positions_offsets_payloads",
            "analyzer" : "fulltext_analyzer"
           }
         }
       }
    }
    

    For other cases that don't need to retrieve also the position, it is faster and uses much less space to use the index option "offsets" - elastic doc here, plugin doc here:

    PUT /my-index-000001
    { "mappings": {
        "properties": {
          "text": {
            "type": "text",
            "index_options": "offsets",
            "analyzer" : "fulltext_analyzer"
           }
         }
       }
    }
    

    Then you could query with the experimental highlighter and return only offset of the highlighter part:

    {
      "query": {
        "match": {
          "text": "hello world"
        }
      },
      "highlight": {
        "order": "score",
        "fields": {
          "text": {
            "number_of_fragments": 10,
            "fragment_size": 15,
            "type": "experimental",
            "options": {"return_offset": true}
          }
        }
      }
    }
    

    In this way no text is returned from your query but only the start offset and the end offset - numbers that represent position. To retrieve your highlighted content you need to enter inside ['hits']['hits'][0]['_source']['text'] -text is your field name - and extract text from the field using your start offset point and the end offset point. You need to ensure to use the correct string encoding - UTF-8 - otherwise the offsets don't match text. According to the doc:

    The return_offsets option changes the results from a highlighted string to the offsets in the highlighted that would have been highlighted. This is useful if you need to do client side sanity checking on the highlighting. Instead of a marked up snippet you'll get a result like 0:0-5,18-22:22. The outer numbers are the start and end offset of the snippet. The pairs of numbers separated by the ,s are the hits. The number before the - is the start offset and the number after the - is the end offset. Multi-valued fields have a single character worth of offset between them.

    Let me know if that plugin could help!