Search code examples
javasolrescapinglucene-highlighter

Highlighting fields containing HTML


I have a field that might contain HTML code as a user input. If I use simple highlighter, it does not escape the input before adding the <em> tag. E.g. if the input is

"This is a <caption>"

and I search for "caption", I get:

"This is a <<em>caption</em>>"

But I want to get:

"This is a &lt;<em>caption</em>&gt;"

Which will look the same as the input with the matched word highlighted, when rendered as HTML.


Solution

  • One technique is to use some other sentinel string to indicate highlighting. See hl.simple.pre and hl.simple.post. That way you can perform escaping first, without losing your highlighting, and then replace the sentinels with highlighting markup as a final step.

    For example, the Sunspot Solr client for Ruby uses @@@hl@@@ for the hl.simple.pre param, and @@@endhl@@@ for the hl.simple.post param. Using these values…

    • Solr returns: This is a <@@@hl@@@caption@@@endhl@@@>
    • HTML escaping: This is a &lt;@@@hl@@@caption@@@endhl@@@&gt;
    • Replace the sentinels: This is a &lt;<em>caption</em>&gt;