I am using HTML Purifier in my PHP project and am having trouble getting it to work properly with user input.
I am having users enter in HTML using a WYSIWYG editor (TinyMCE), but whenever a user enters in the HTML entity
(non-breaking space) it gets saved into the database as this weird foreign character (Â
).
However, the thing is, when I edit the saved entry using the WYSIWYG editor it gets displayed properly as
. It also functions properly when displayed, only that in the source code it appears as a real space, but not the non-breaking space character.
Also, in the MySQL database it displays as the weird foreign character.
I read the doc about Unicode and HTML Purifier and changed my database and web page encoding to be UTF-8, but I am still having problems with the non-breaking space character not being mangled. The other HTML entities, such as <
and >
, get saved as <
and >
, but why not
?
The non-breaking space isn't being saved in your database as one weird foreign character, it's being saved as two characters. The Unicode non-breaking space character is encoded in UTF-8 as 0xC2 0xA0
, which in ISO-8859-1 looks like "Â " (i.e. a weird foreign character followed by a non-breaking space).
You're probably forgetting to do SET NAMES 'utf8'
on your database connection, which causes PHP to send its data to MySQL as ISO-8859-1 (the default).
Have a look at "UTF-8 all the way through…" to see how to properly set up UTF-8 when using PHP and MySQL.