Searching transcripts with associated timestamps in ElasticSearch

I have documents which look like this:

{
  "title": "Hello, World!",
  "segments": [
    {
      "text": "waive such protections",
      "start": 0,
      "end": 7040
    },
    {
      "text": "in all contexts",
      "start": 7040,
      "end": 8500
    }
  ]
}

This example represents the end of a complete sentence: "waive such protections in all contexts." The problem I'm running into is that I can do a phrase search for "waive such protections" OR "in all contexts", but I cannot do a phrase search for "waive such protections in all contexts.

Is there a way I can do a phrase search and allow the phrase search to span multiple adjacent entries in the segments array?

Alternatively, is there a better way I can structure my documents? My goal is to be able to search for phrases and return the timestamp of where that phrase was found in a transcript (these documents are generated from vtt files, but I can change them into any shape I need).

Right now I am doing multiple phrase searches of all sequential subsets of the query. Essentially a bool query with minimum_should_match set to 1, and should set to phrase searches for waive, waive such, waive such protections, waive such protections in, etc. The query "waive such protections in all contexts" actually becomes 35 queries. This works, but is very slow.

Solution

Is there a way I can do a phrase search and allow the phrase search to span multiple adjacent entries in the segments array?

Unfortunately not, first Elasticsearch flattens those arrays (unless they are nested) into a single document which is then converted to a unique word list. At this point all information about neighboring fields is lost. Elasticsearch documentation goes into more detail:

When a document is stored, it is indexed and fully searchable in near real-time--within 1 second. Elasticsearch uses a data structure called an inverted index that supports very fast full-text searches. An inverted index lists every unique word that appears in any document and identifies all of the documents each word occurs in.

As you pointed out you could try and slowly build a series of queries that could find every possible way a split could occur. I would have suggested using the Regexp query to at least ensure the words in the field were positioned at the start or end but anchor operators are not supported:

Lucene’s regular expression engine does not support anchor operators, such as ^ (beginning of line) or $ (end of line).

Option 1

One idea that could improve things is to simply add a field that concatenates the neighboring text fields:

"segments": [
    {
      "text": "waive such protections",
      "text_follows": "waive such protections in all contexts",
      "start": 0,
      "end": 7040
    },
    {
      "text": "in all contexts",
      "text_follows": "",
      "start": 7040,
      "end": 8500
    }
]

You can then perform a simple match phrase query and remove the need for those 35 queries.

"query": {
  "match_phrase": {
    "segments.text_follows": "waive such protections in all contexts"
  }
}

This does increase the size of the index and the maximum length of a matched phrase is set at the creation of the index.

Another thing to watch out for is that even though one array item was matched, Elasticsearch will still return the entire array. This can quickly increase the size of the HTTP response. To get round this issue consider using nested arrays which splits each item of the array into separate documents.

PUT my-index
{
  "mappings": {
    "properties": {
      "segments": {
        "type": "nested"
      }
    }
  }
}

These documents can then be retrieved with inner hits:

GET my-index/_search
{
  "_source": false,
  "query": {
    "nested": {
      "path": "segments",
      "query": {
        "match_phrase": {
          "segments.text_follows": "such protections in"
        }
      },
      "inner_hits": {}
    }
  }
}

You will then need to remove any results at the client side where the original text field does not end with the beginning of the searched phrase.

...
    "hits": [
      {
        "_index": "my-index",
        "_id": "1",
        "_nested": {
          "field": "segments",
          "offset": 0
        },
        "_score": 0.8630463,
        "_source": {
          "text": "waive such protections",
          "text_follows": "waive such protections in all contexts",
          "start": 0,
          "end": 7040
        }
      }
    ]
...

Option 2

However, the previous option doesn't scale well for more complex or long match_phrase queries, especially if you want to allow users to create their own search terms using the simple query string query.

To get round that you would need to concatenate all text data into one field in one document and encode the timestamps directly beside each word which is then removed by a pattern replace character filter so that data is ignored when searching.

PUT my-index
{
  "mappings": {
    "properties": {
      "title": {"type": "text"},
      "text": {"type": "text", "analyzer": "my_analyzer"}
    }
  },
  "settings": {
    "analysis": {
      "analyzer": {
        "my_analyzer": {
          "tokenizer": "standard",
          "char_filter": [
            "my_char_filter"
          ]
        }
      },
      "char_filter": {
        "my_char_filter": {
          "type": "pattern_replace",
          "pattern": "\\[[0-9]+\\]",
          "replacement": ""
        }
      }
    }
  }
}

PUT my-index/_doc/1
{
  "title": "Hello, World!",
  "text": "waive[0] such[0] protections[0] in[7040] all[7040] contexts[7040]"
}

You will then need to rely on the built-in highlighter to extract only the relevant part of the text.

GET my-index/_search
{
  "_source": false,
  "query": {
    "match_phrase": {
      "text": "such protections in"
    }
  },
  "highlight": {
    "fields": {
      "text": {}
    }
  }
}

The hits will then need to be processed on the client side to extract the timestamp data:

...
    "hits": [
      {
        "_index": "my-index",
        "_id": "1",
        "_score": 0.8630463,
        "highlight": {
          "text": [
            "waive[0] <em>such[0] protections[0] in[7040]</em> all[7040] contexts[7040]"
          ]
        }
      }
    ]
...

If you are not worried about space you could also include the end timestamps (e.g. waive[0-7040]) and adjust the regex in the char_filter

Hope that helps