Search code examples
androidxmlsaxparsersaxparseexceptionparseexception

Android - SaxParser error: ParseException: At line 1, column 0: not well-formed (invalid token)


I'm having the following exception when trying to parse some XML:

org.apache.harmony.xml.ExpatParser$ParseException: At line 1, column 0: not well-formed (invalid token)

The main issue is that this has only happened in Android 2.2 or 2.3 devices, but the weirdest thing is that the first time I parse the response it is ok, but all the following tries give me the parsing exception.

My code is as follows:

        URL url = new URL("http://m.ideasmusik.com/rss/?ct=mx");
        SAXParserFactory spf = SAXParserFactory.newInstance();
        SAXParser sp = spf.newSAXParser();
        //InputSource is = new InputSource("http://m.ideasmusik.com/rss/?ct=mx");
        //is.setEncoding(HTTP.UTF_8);   

        // Parse content
        MusicRSSParser parser = new MusicHandler.MusicRSSParser(); //DefaultHandler
        XMLReader xr = sp.getXMLReader();
        xr.setContentHandler(parser);
        InputSource in = new InputSource(url.openStream());//is.getByteStream());
        in.setEncoding(HTTP.UTF_8);
        xr.parse(in);

The XML is UTF-8 (I've read that is a common problem to have incorrect encoding).

Any guess on what is going wrong? I thought that it could be something with my handler but it crashes before my logic applies, right after the startDocument() method.

i have tried with Url instead of InputStream with the same result.

EDIT

If I go to Application Management and erase app caché, then it works ok, for the first time. How can it be affecting the parsing??


Solution

  • Got it!

    The problem is that the RSS has a problem!

    Not every browser shows it (when they format it with colors they erase the problem), but the source code begins like:

    <?xml version=\"1.0\" encoding=\"UTF-8\"?>
          <rss version=\"2.0\">
              <channel>
                   <title>Top Canciones</title>
                   <link>m.ideasmusik.com/rss/?ct=mx&</link> ...
    

    The problem is that XML can't have & symbols without being escaped.

    All the other symbols were escaped in the document but I think they miss that one because it is in the link tag and not as main content.

    Somehow on the first run the SAX parser ignores that..

    What I did (while the RSS is fixed) was to get the string response and remove that & manually before parsing the XML. I know that is a horrible solution but it's the quickest and easiest solution for the moment.