Search code examples
phphtml-encode

Confused with html encoding


I am getting confused with character encoding.

I understand people do things differently, but many suggest you should store your input in the database as it is entered, then deal with it when you are reading it in accordance with what you are planning to do with it. This makes sense to me.

So, if a user enters an apostrophe, double quote or ampersand, less than, greater than sign, these will be written in my database as ' " & < > respectively.

Now, reading the data with php, I am running the text through HTMLPurify to catch any injection issues.

Should I also htmlencode it? If I don't, it all appears OK (in Chrome and Firefox) but I am not sure if this is correct and will it display properly in other browsers?

If I use htmlentities with ENT_QUOTES, and htmlspecialchars, I start getting the codes coming through for these characters, which I believe is what I should see if looking at the page source, but not on the page the user sees.

The problem is, without doing the encoding, I am seeing what I want to see, but have this niggle in my mind, that I am not doing it correctly!


Solution

  • You have this confused. Character encoding is an attribute of YOUR systems. Your websites and your database are responsible for character encoding.

    You have to decide what you will accept. I would say in general, the web has moved towards standardization on UTF-8. So if your websites that accept user input AND your database, and all connections involved are UTF-8, then you are in a position to accept input as UTF-8, and your character set and collation in the database should be configured appropriately.

    At this point all your web pages should be HTML5, so the recommended HEAD section of your pages should at a minimum be this:

    <!DOCTYPE html>
    <html lang="en"> 
    <head>
    <meta charset="utf-8"/>
    

    Next you have SQL injection. You specified PHP. If you are using mysqli or PDO (which is in my experience the better choice) AND you are using bindParameter for all your variables, there is NO ISSUE with SQL injection. That issue goes away, and the need for escaping input goes away, because you no longer have to be concerned that a SQL statement could get confused. It's not possible anymore.

    Finally, you mentioned htmlpurifier. That exists so that people can try and avoid XSS and other exploits of that nature, that occur when you accept user input, and those people inject html & js.

    That is always going to be a concern, depending on the nature of the system and what you do with that output, but as others suggested in comments, you can run sanitizers and filters on the output after you've retrieved it from the database. Sitting inside a php string variable there is no intrinsic danger, until you weaponize it by injecting it into a live html page you are serving.

    In terms of finding bad actors and people trying to mess with your system, you are obviously much better off having stored the original input as submitted. Then as you come to understand the nature of these exploits, you can search through your database looking for specific things, which you won't be able to do if you sanitize first and store the result.