Search code examples
phphtmlsearchescapinghtmlspecialchars

PHP highlight query and escape html special characters


I'm trying to program a search function that hightlights the search query in the result. At the moment I'm using this Code $hightlight = preg_replace('/'.strtolower($query).'/', '<span class=hightlight>'.strtolower($query).'</span>', strtolower($text)); for highlighting, which works fine. The text I'm searching in is a string from a database. The problem now is if the text contains some html special characters, and is for example <test> and the user searches for <te I get the following result: <span class="hightlight"><te< span="">st&gt;</te<></span> which is interpretated as st>. This makes sense, but I don't want this. I want <test> as result with <te highlighted. So I need to escape the special characters. I know that there is the function htmlspecialchars, but how can I use it in this case? Or another function? I can't escape them before searching, because than I'm also searching in the HTML-Codes. I also can't escape them after searching, because than are the <span> Tags in the text and they will also be converted to HTML-Codes. I hope you understand my problem. Has anyone a solution for that?


Solution

  • Using a combination of htmlspecialchars() and a regex negative lookahead, I think we're able to solve this.

    <php
    $text = "this is just my really basic <test> of words";
    $query = "<te";
    
    $text = htmlspecialchars($text);
    $query = htmlspecialchars($query);
    
    $highlight = preg_replace('/'.strtolower($query).'(?![^\&]*\;)/', '<span class=highlight>'.strtolower($query).'</span>', strtolower($text));
    
    echo $highlight;
    ?>
    

    (small note, I took the liberty of changing hightlight to highlight)

    DEMO

    The part of this that solves the issue mentioned in your comment is the negative lookahead: (?![^\&]*\;)

    That basically means anything not between & and ;.

    Now, this could obviously run into issues in some edge cases where & and ; are both part of the actual text. If you're not doing any sort of text and query limitation/sanitation, I'm not sure that there's anything that will work for all possible cases.