I have UTF-8 text in a string (let's call it the "plain text") and I need to inject that text inside HTML code.
I'm using htmlspecialchars to convert special characters (that may occurr in the plain-text) to HTML entities.
This is a common problem, however....
the resulting string is the Html source of EMAILs
So I'm concerned if specific measures should be taken in the conversion process.
I'm know there are some differencies and inconsistencies in the way email clients render html.
Also a rule of thumb I often I've read is write your HTML like you're in 2001
Is htmlspecialchars
good for the converting task?
Also which flags should I set ?
Normally I use:
$html = htmlspecialchars( $text, ENT_QUOTES, 'UTF-8' );
Should I use ENT_QUOTES | ENT_HTML401
?
In short, it depends if you want to send a UTF-8 email, or an ASCII email.
UTF-8 Email - just htmlspecialchars fine:
// We're telling it that $text is UTF-8 (+see below about control chars)
$html = htmlspecialchars( $text, ENT_DISALLOWED, 'UTF-8' );
This will swap out <, >, " and & for you. Anything else, like é, will pass straight through unchanged (which would be fine, as the email itself is UTF-8 too).
ASCII Email - you'll need to do a HTML 4.01 entity swap out (which is the default), but with the same ENT_DISALLOWED flag:
// Same again - see below about the flags:
$html = htmlentities( $text, ENT_DISALLOWED, 'UTF-8' );
This will swap out as many entities as possible to make sure things like é are represented in ASCII (as é ;).
This part depends entirely on your audience and the kinds of email clients you're expecting to interact with. A brief tour of history should help you decide!
Up until roughly 2006, the vast majority of web was ASCII. Named character entities, such as é ; existed to let web pages support much broader unicode codepoints, as well as to display characters which are important to HTML. Here's the first issue: support for UTF-8 emails can be patchy.
If you're going for broad coverage with older clients then sending an ASCII email is a safer bet. That means you'll need to convert all of the unicode code points which are out of range of ASCII into an ASCII compatible representation (html entities). Fundamentally this is targeting older clients so using ENT_HTML5 - the greatly expanded entities set - makes no sense here.
However here's the other issue - the older HTML 4.01 entity set represents far fewer unicode codepoints, so if you're expecting to send text in a broad range of languages then you'll most likely need to send a UTF-8 email instead.
UTF-8 vs. ASCII email self-test questions:
It's important to note that control characters - particularly the null byte - won't be handled by either htmlentities or htmlspecialchars by default. The null byte when presented on the web is also notorious for crashing things, including somewhat famously Chrome with a short URL containing one. I'm unsure how many email clients correctly handle the null byte but I'm very inclined to think that it's not many of them. So, the ENT_DISALLOWED
flag will strip them out and drop in a safer character for you.