Search code examples
phpxmlutf-8disqus

XML read error because of bad UTF8 encoding


I'm trying to create a script to export my comments to Disqus and, in order to do that, I need to make a huge XML file.

I have a problem with encodement in UTF 8. It's supposed that the file is in UTF-8 but I need to make utf8_decode in order to have my Spanish elements shown properly.

The file generated is like that:

<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
    xmlns:content="http://purl.org/rss/1.0/modules/content/"
    xmlns:dsq="http://www.disqus.com/"
    xmlns:dc="http://purl.org/dc/elements/1.1/"
    xmlns:wp="http://wordpress.org/export/1.0/"
>
<channel>
    <wp:comment>
        <wp:comment_id>26</wp:comment_id>
        <wp:comment_author>KA_DIE</wp:comment_author>
        <wp:comment_author_email> </wp:comment_author_email>
        <wp:comment_author_url></wp:comment_author_url>
        <wp:comment_author_IP> </wp:comment_author_IP>
        <wp:comment_date_gmt>2009-07-16 18:53:19</wp:comment_date_gmt>
        <wp:comment_content><![CDATA[WTF TEH Gladios en español <br />tnx tnx <br />me usta mucho esa web estoy pendiente mucho se su actualziacion es buen saber ke esta en español <br />x que solo entendia el 80, 90% de la paguina jiji]]></wp:comment_content>
        <wp:comment_approved>1</wp:comment_approved>
        <wp:comment_parent>0</wp:comment_parent>
    </wp:comment>
</channel>
</rss>

Deleted data for security reasons such as IP or email. As you can see, it contains "ñ" letter. But the XML shown throws an error:

XML read error: bad composed

I don't know the exactly translation but it crash in the content line. The code is generated with this:

public function generateXmlElement (){
            $xml = "<wp:comment>
                        <wp:comment_id>$this->id</wp:comment_id>
                        <wp:comment_author>$this->author</wp:comment_author>
                        <wp:comment_author_email>$this->author_email</wp:comment_author_email>
                        <wp:comment_author_url>$this->author_url</wp:comment_author_url>
                        <wp:comment_author_IP>$this->author_ip</wp:comment_author_IP>
                        <wp:comment_date_gmt>$this->date</wp:comment_date_gmt>
                        <wp:comment_content><![CDATA[$this->content]]></wp:comment_content>
                        <wp:comment_approved>$this->approved</wp:comment_approved>
                        <wp:comment_parent>0</wp:comment_parent>
            </wp:comment>";
            return $xml;
        }

And then fwrite to a file.

Do you know what should be the problem?


Solution

  • You should be using a proper XML library to generate XML. LibXML2 comes bundled with PHP and is accessible from PHP's DOM API. That will handle the encoding issues, among other things. As is usually the case with such things, it's an upfront learning investment the benefit of which will not immediately be clear. But a benefit there is.