I'm generating a XML Dom with DomDocument in php, containing some news, with title, date, links and a description. The problem occurs on description of some news, but not on others, and both of them contains accents and cedilla.
I create the XML Dom element in UTF-8:
$dom = new \DOMDocument("1.0", "UTF-8");
Then, I retrieve my text from a MySQL database, which is encoded in latin-1, and after I tested the encoding with mb_detect_encoding
it returns UTF-8.
I tried the following:
iconv('UTF-8', 'ISO-8859-1', $descricao);
iconv('UTF-8', 'ISO-8859-1//TRANSLIT', $descricao);
iconv('ISO-8859-1', 'UTF-8', $descricao);
iconv('ISO-8859-1//TRANSLIT', 'UTF-8', $descricao);
mb_convert_encoding($descricao, 'ISO-8859-1', 'UTF-8');
mb_convert_encoding($descricao, 'UTF-8', 'ISO-8859-1');
mb_convert_encoding($descricao, 'UTF-8', 'UTF-8'); //that makes no sense, but who knows
Also tried changing the database encode to UTF-8, and changing the XML encode to ISO-8859-1.
This is the full method that generates the XML:
$informativos = Informativo::where('inf_ativo','S')->orderBy('inf_data','DESC')->take(20)->get();
$dom = new \DOMDocument("1.0", "UTF-8");
$dom->preserveWhiteSpace = false;
$dom->formatOutput = true;
$rss = $dom->createElement("rss");
$channel = $dom->createElement("channel");
$title = $dom->createElement("title", "Informativos");
$link = $dom->createElement("link", "http://example.com/informativos");
$channel->appendChild($title);
$channel->appendChild($link);
foreach ($informativos as $informativo) {
$item = $dom->createElement("item");
$itemTitle = $dom->createElement("title", $informativo->inf_titulo);
$itemImage = $dom->createElement("image", "http://example.com/".$informativo->inf_ilustracao);
$itemLink = $dom->createElement("link", "http://example.com/informativo/".$informativo->informativo_id);
$descricao = strip_tags($informativo->inf_descricao);
$descricao = str_replace(" ", " ", $descricao);
$descricao = str_replace(" ", " ", $descricao);
$descricao = substr($descricao, 0, 150).'...';
$itemDesc = $dom->createElement("description", $descricao);
$itemDate = $dom->createElement("pubDate", $informativo->inf_data);
$item->appendChild($itemTitle);
$item->appendChild($itemImage);
$item->appendChild($itemLink);
$item->appendChild($itemDesc);
$item->appendChild($itemDate);
$channel->appendChild($item);
}
$rss->appendChild($channel);
$dom->appendChild($rss);
return $dom->saveXML();
Here is an example of successful output:
Segundo a instituição, número de pessoas que vivem na pobreza subiu 7,3 milhões desde 2014, atingindo 21% da população, ou 43,5 milhões de br
And an example that gives the encoding error:
procuradores da Lava Jato em Curitiba, que estão sendo investigados por um
suposto acordo fraudulento com a Petrobras e o Departamento de Justi�...
Everything renders fine, until the problematic description text above, that gives me:
"This page contains the following errors: error on line 118 at column 20: Encoding error Below is a rendering of the page up to the first error."
Probably that
is the problem here. Since I can't control whether or not the text have this, I need to render these special characters correctly.
UPDATE 2019-04-12: Found out the error on the problematic text and changed the example.
The encoding of the database connection is important. Make sure that it is set to UTF-8. It is a good idea to use UTF-8 most of the time (for your fields). Character sets like ISO-8859-1
have only a very limited amount of characters. So if a Unicode string gets encoded into them it might loose data.
The second argument of DOMDocument::createElement()
is broken. In only encodes some special characters, but not &
. To avoid problems create and append the content as an separate text node. However DOMNode::appendChild()
returns the append node, so the DOMElement::create*
methods can be nested and chained.
$data = [
[
'inf_titulo' => 'Foo',
'inf_ilustracao' => 'foo.jpg',
'informativo_id' => 42,
'inf_descricao' => 'Some content',
'inf_data' => 'a-date'
]
];
$informativos = json_decode(json_encode($data));
function stripTagsAndTruncate($text) {
$text = strip_tags($text);
$text = str_replace([" ", " "], " ", $text);
return substr($text, 0, 150).'...';
}
$dom = new DOMDocument('1.0', 'UTF-8');
$rss = $dom->appendChild($dom->createElement('rss'));
$channel = $rss->appendChild($dom->createElement("channel"));
$channel
->appendChild($dom->createElement("title"))
->appendChild($dom->createTextNode("Informativos"));
$channel
->appendChild($dom->createElement("link"))
->appendChild($dom->createTextNode("http://example.com/informativos"));
foreach ($informativos as $informativo) {
$item = $channel->appendChild($dom->createElement("item"));
$item
->appendChild($dom->createElement("title"))
->appendChild($dom->createTextNode($informativo->inf_titulo));
$item
->appendChild($dom->createElement("image"))
->appendChild($dom->createTextNode("http://example.com/".$informativo->inf_ilustracao));
$item
->appendChild($dom->createElement("link"))
->appendChild($dom->createTextNode("http://example.com/informativo/".$informativo->informativo_id));
$item
->appendChild($dom->createElement("description"))
->appendChild($dom->createTextNode(stripTagsAndTruncate($informativo->inf_descricao)));
$item
->appendChild($dom->createElement("pubDate"))
->appendChild($dom->createTextNode($informativo->inf_data));
}
$dom->formatOutput = TRUE;
echo $dom->saveXML();
Output:
<?xml version="1.0" encoding="UTF-8"?>
<rss>
<channel>
<title>Informativos</title>
<link>http://example.com/informativos</link>
<item>
<title>Foo</title>
<image>http://example.com/foo.jpg</image>
<link>http://example.com/informativo/42</link>
<description>Some content...</description>
<pubDate>a-date</pubDate>
</item>
</channel>
</rss>
Truncating an HTML fragment can result in broken entities and broken code points (if you don't use a UTF-8 aware string function). Here are two approaches to solve it.
You can use PCRE in UTF-8 mode and match n entities/codepoints:
// have some string with HTML and entities
$text = 'Hello<b>äöü</b> ä foobar';
// strip tags and replace some specific entities with spaces
$stripped = str_replace([' ', ' '], ' ', strip_tags($text));
// match 0-10 entities or unicode codepoints
preg_match('(^(?:&[^;]+;|\\X){0,10})u', $stripped, $match);
var_dump($match[0]);
Output:
string(18) "Helloäöü ä"
However I would suggest using DOM. It can load HTML and allow to use Xpath expressions on it.
// have some string with HTML and entities
$text = 'Hello<b>äöü</b> ä foobar';
$document = new DOMDocument();
// force UTF-8 and load
$document->loadHTML('<?xml encoding="UTF-8"?>'.$text);
$xpath = new DOMXpath($document);
// use xpath to fetch the first 10 characters of the text content
var_dump($xpath->evaluate('substring(//body, 1, 10)'));
Output:
string(15) "Helloäöü ä"
DOM in general treats all strings as UTF-8. So Codepoints are a not a problem. Xpaths substring()
works on the text content of the first matched node. The argument are character positions (not index) so they start with 1.
DOMDocument::loadHTML() will add html
and body
tags and decode entities. The results will a little bit cleaner then with the PCRE approach.