Retrieving Token Payloads During Search

I have an index defined like the following, which uses the delimited payload token filter and stores payloads along with tokens:

PUT text_payloads
{
  "mappings": {
    "properties": {
      "text": {
        "type": "text",
        "term_vector": "with_positions_payloads",
        "analyzer": "payload_delimiter"
      }
    }
  },
  "settings": {
    "analysis": {
      "analyzer": {
        "payload_delimiter": {
          "tokenizer": "whitespace",
          "filter": [ "delimited_payload" ]
        }
      }
    }
  }
}

With a document in that index like such:

POST text_payloads/_doc/1
{
  "text": "the|0 brown|3 fox|4 is|0 quick|10"
}

I can get the payloads using the _termvectors api:

GET text_payloads/_termvectors/1
{
  "fields": [ "text" ],
  "payloads": true
}

This returns the following result:

{
  "_index": "text_payloads",
  "_id": "1",
  "_version": 1,
  "found": true,
  "took": 0,
  "term_vectors": {
    "text": {
      "field_statistics": {
        "sum_doc_freq": 5,
        "doc_count": 1,
        "sum_ttf": 5
      },
      "terms": {
        "brown": {
          "term_freq": 1,
          "tokens": [
            {
              "position": 1,
              "payload": "QEAAAA=="
            }
          ]
        },
        "fox": {
          "term_freq": 1,
          "tokens": [
            {
              "position": 2,
              "payload": "QIAAAA=="
            }
          ]
        },
        "is": {
          "term_freq": 1,
          "tokens": [
            {
              "position": 3,
              "payload": "AAAAAA=="
            }
          ]
        },
        "quick": {
          "term_freq": 1,
          "tokens": [
            {
              "position": 4,
              "payload": "QSAAAA=="
            }
          ]
        },
        "the": {
          "term_freq": 1,
          "tokens": [
            {
              "position": 0,
              "payload": "AAAAAA=="
            }
          ]
        }
      }
    }
  }
}

If I use the _search endpoint instead, using a match_phrase query and a highlighter:

POST text_payloads/_search
{
  "query": {
    "match_phrase": {
      "text": "brown fox"
    }
  },
  "highlight": {
    "pre_tags": ["<mark>"],
    "post_tags": ["</mark>"],
    "encoder": "html",
    "fields": {
      "text": {}
    }
  }
}

I get the following result:

{
  "took": 3,
  "timed_out": false,
  "_shards": {
    "total": 1,
    "successful": 1,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": {
      "value": 1,
      "relation": "eq"
    },
    "max_score": 0.5753642,
    "hits": [
      {
        "_index": "text_payloads",
        "_id": "1",
        "_score": 0.5753642,
        "_source": {
          "text": "the|0 brown|3 fox|4 is|0 quick|10"
        },
        "highlight": {
          "text": [
            "the|0 <mark>brown|3</mark> <mark>fox|4</mark> is|0 quick|10"
          ]
        }
      }
    ]
  }
}

This works, and I can parse the payloads out of the highlighted search result, but I'd rather ElasticSearch give me the payloads in some sort of structured format. It would be even better if ElasticSearch then highlighted the result and didn't include said payloads in the result but just gave me the plain text.

Is anything like this possible with ElasticSearch? Or should I stick to parsing the result with the payloads embedded?

Solution

but I'd rather ElasticSearch give me the payloads in some sort of structured format.

The term vectors and multi term vectors APIs is the way to go here.

Is anything like this possible with ElasticSearch?

Not at the moment unless you want to implement it as a plugin.

Or should I stick to parsing the result with the payloads embedded?

I think that's the simplest way to deal with it. There is not much support for handling payloads outside of plugins. Highlighter is also oblivious of payload delimited format so for a highlighter brown|3 is just a part of input text that gets indexed as brown, the highlighter finds brown and highlights the corresponding text based on the stored positions or additional analysis. Which means it will highlight brown|3. If you don't want to remove payloads you need to index this field twice - with and without payload and highlight the version without payload.

Using information discovered by the highlighter to find the corresponding parts of term vectors is also tricky. Internally highlighter knows exactly the location of the original token but it doesn't give this information back to the client, only the result of applying this information to the original string. There is almost a decade old issue where users are asking to enable this.