Search code examples
phphtml-escape-characters

Should I be using htmlspecialchars?


I seem to have trouble understanding when to use htmlspecialchars().

Let's say I do the following when I am inserting data:

$_POST = filter_input_array(INPUT_POST, [
    'name' => FILTER_SANITIZE_STRING,
    'homepage' => FILTER_DEFAULT // do nothing
]);

$course = new Course();
$course->name = trim($_POST['name']);
$course->homepage = $_POST['homepage']; // may contain unsafe HTML

$courseDAO = DAOFactory::getCourseDAO();
$courseDAO->addCourse($course);  // simple insert statement

When I ouput, I do the following:

$courseDAO = DAOFactory::getCourseDAO();
$course = $courseDAO->getCourseById($_GET['id']);
?>

<?php ob_start() ?>

<h1><?= $course->name ?></h1>
<div class="homepage"><?= $course->homepage ?></div>

<?php $content = ob_get_clean() ?>

<?php include 'layout.php' ?>

I would like that $course->homepage be treated and rendered as HTML by the browser.

I've been reading answers on this question. Should I be using htmlspecialchars() anywhere here?


Solution

  • There are (from a security POV) three types of data that you might output into HTML:

    • Text
    • Trusted HTML
    • Untrusted HTML

    (Note that HTML attributes and certain elements are special cases, e.g. onclick attributes expect HTML encoded JavaScript so your data needs to be HTML safe and JS safe).

    If it is text, then use htmlspecialchars to convert it to HTML.

    If it is trusted HTML, then just output it.

    If it is untrusted HTML then you need to sanitise it to make it safe. That generally means parsing it with a DOM parser, and then removing all elements and attributes that do not appear on a whitelist as safe (some attributes may be special cased to be filtered rather than stripped), and then converting the DOM back to HTML. Tools like HTML Purifier exist to do this.

    $course->homepage = $_POST['homepage']; // may contain unsafe HTML

    I would like that $course->homepage be treated and rendered as HTML by the browser.

    Then you have the third case and need to filter the HTML.