Search code examples
phphtml-entitieshtml-encode

php not encoding em dash (among other things correctly);


I have a small JSON object that I'd like to send to php to put in a mySQL database. Part of the information in the string is html entities. &emdash is giving me problems. It is showing up as â€. There are some other problems with é displaying as é.

I seem to be having some encoding problems. Any idea what could be wrong? Thanks


Solution

  • Because the data is coming from JSON, it should be encoded in a Unicode character set, the default being UTF-8 [Sources: Douglas Crockford, RFC4627].

    This means that in order to store a non-ASCII character in your database, you will either need to convert the encoding of the incoming data to the character set of you database, or (preferably) use a Unicode character set for your database. The most common Unicode character set - and the one I'd recommend you use for this purpose - is UTF-8.

    It is likely that your database is set up with one of the latin character sets (ISO-8859-*), in which case you will most likely simply need to change the character set used for your table and it won't break any of your existing data - assuming that you currently have no records that use any characters outside the lower 128. Based on you comments above, you should be able to make this change using phpMyAdmin - you will need to ensure that you change each existing column you wish to alter explicitly, changing the character set of a table/database will only affect new columns/tables that are created without specifying a character set.

    When you are outputting data to the client, you will also need to tell it that you are outputting UTF-8 so it knows how to display the characters correctly. You do this by ensuring you append ; charset=utf-8 to the Content-Type: header you send along with text-based content.

    For example, at the top of a PHP script that produces HTML that is encoded with UTF-8, you would add this line:

    header('Content-Type: text/html; charset=utf-8');
    

    It is also recommended that you declare the character set of the document within the document itself. This declaration must appear before any non-ascii characters that exist within the document - as a result, it is recommended that you place the following <meta> tag as the first child of the <head>:

    <meta http-equiv="Content-Type" content="text/html; charset=utf-8">
    

    If you are producing XHTML with an XML declaration at the top, the character set may be declared there, instead of using a <meta> tag:

    <?xml version="1.0" encoding="UTF-8" ?>
    

    Remember, the use of a character set definition in the Content-Type: header is not limited to text/html - it makes sense in the context of any text/* family MIME type.

    Further reading: What Every Programmer Absolutely, Positively Needs To Know About Encodings And Character Sets To Work With Text

    Also, make sure you validate your markup.