We are working with Elasticsearch percolator.
We are trying to show all of the highlighted items in a single text and not getting many different results. But as far as we know, this is not possible with the current ElasticSearch version. Although we found that this may be achived by using the (upgraded version) Lucene as it supports an unified highlighted result, we have no time for that.
We need quick and easy ideas to solve this problem. We found that this can be done by adding the respective html decorations after but we are thinking about listing each word for each result then using that list to find all the items in the original text to order the results in their appearing position.
The question is, which is the corect and easier process for unify all ElasticSearch highlighting results in a single consolidated result?
Thank you
After having finished the small project we opted for a python approach.
The main problem is the way in which ElasticSearch delivers the highlighted results: it is designed for search engines, therefore between pieces of the text with the highlighted result, instead of delivering the complete text.
For this reason we have chosen to highlight the results by postprocessing instead of using the highlighter of ElasticSearch: we obtain the results of the search, we process them by means of python and finally we deliver the complete text with the highlighted words.
First, the function for get querysearch results:
def get_response(client, index, query):
s = Search().using(client).index(index).query("percolate", field='query', document={'title': query})
response = s.execute()
# get all matches: s.scan() https://elasticsearch-dsl.readthedocs.io/en/latest/search_dsl.html#pagination
return response
Percolator is a class not present in the current elasticsearch-dsl-py
release version, so, for now we implement it:
class Percolate(Query):
name = 'percolate'
Second, we get all terms and their documents id:
def get_highlighted_term(response):
dic_results = defaultdict(list)
for hit in response:
for query in hit.query:
if query == 'span_term':
dic_results[hit.query.span_term.title].append(hit.doc_id)
if query == 'span_near':
phrase = ''
for title in hit.query.span_near.clauses:
phrase += title.span_term.title + ' '
dic_results[phrase[:-1]].append(hit.doc_id)
return dic_results
We use a dictionary for its versatility: the title / term as a key, and the identifiers of the document as its value; this will make it easier to obtain the corresponding values during text highlighting.
Finally, we get the result text:
def get_highlighted_text(dic_results, text):
for term, doc_ids in dic_results.items():
insensitive_term = re.compile(re.escape(term), re.IGNORECASE)
if len(doc_ids) > 1:
result_text = "<ul id='multiple-links'>"
for doc_id in doc_ids:
result_text += "<li><a href='http://localhost/{0}'>{1}</a></li>".format(doc_id, term)
result_text += "</ul>"
text = insensitive_term.sub(result_text, text)
else:
text = insensitive_term.sub('<a href="http://localhost/{}">\g<0></a>'.format(doc_ids[0]), text)
return text
For this time we handle the common term's document ids as a dropdown list. We also use regex for replacement.
And this is our approach, you can find full project code here.