Search code examples
phphtmlsqldatabasehtml-encode

When and how to encode/decode HTML when interacting with a database?


I have a website which runs on PHP and a MySQL database. I was wondering how to best treat user input in regard to HTML encoding (I am well aware that I should store as received and decode in output: that's what I do) and this cycle in particular:

  • user registers filling in a form with a username field, the content of the field is validated and sent and stored in the DB as is (no HTML encoding) as it will be required to output HTML, XML, JSON, plaintext and other formats;
  • on any page requiring the username to be shown, it will be fetched from the database, HTML-encoded and displayed in the page;
  • on a particular page the username is placed in the "value" field of an html text input: obviously this means that the username must be HTML encoded (otherwise XSS and all those fantastic things...). However this also means that if the original username was "però" the text field will be <input value="per&ograve;"> and when the user submits it the server will receive per&ograve; instead of però.

Now my question is: should the server decode all the received inputs so that per&ograve; gets decoded to the original però?
My doubt is that this would mean that if an user inputs &egrave; as his username it will be registered as è and not as he actually intended...

I know this is not such a big problem (don't know of many users which would want to use HTML special characters encoding literals in their usernames...), but it puzzled me and I could not find a completely satisfying solution.


Solution

  • Unless I've misunderstood what you're asking, you seem to have the wrong impression about the effect of outputting HTML encoded strings into text inputs. Here's a basic example of what will happen. Let's say you have a user who wants to be named PB&amp;J. Sure, it's weird, but not everyone can pick a nice non-weird username like "Bonvi" or "Don't Panic".

    So you save that in your database as is.

    Later, when you're using it in another form, you escape it for output.

    <input type="text" name="username" value="<?= htmlspecialchars($username) ?>">
    

    In your page source, you'll see

    <input type="text" name="username" value="PB&amp;amp;J">
    

    with the ampersand converted to an HTML entity. (Which is what you want, in case they really wanted to be named bob"><script>alert("però!")</script><p class="ha or something worse.)

    But the value displayed in the text box will be PB&amp;J, and when the user submits the form, the value in $_POST['username'] will be PB&amp;J, not PB&amp;amp;J. It will not be changed to the encoded value.

    (I used htmlspecialchars in this example, but the same would apply with your example using però with htmlentities.)

    I'm trying to explain it basically, so I apologize if I did misunderstand you - I don't intend to sound condescending.