Search code examples
javaapache-abdera

Invalid UTF-8 start byte 0x8b (at char #2, byte #-1)


I am trying to parse the atom document from the url 'http://self-learning-java-tutorial.blogspot.in/atom.xml'. While parsing the document, I am getting the error 'Invalid UTF-8 start byte 0x8b (at char #2, byte #-1)'.

Abdera abdera = new Abdera();
        Parser parser = abdera.getParser();

        URL url = new URL("http://self-learning-java-tutorial.blogspot.in/atom.xml");

        Document<Feed> doc = parser.parse(url.openStream(), url.toString());
        Feed feed = doc.getRoot();
        System.out.println(feed.getTitle());
        for (Entry entry : feed.getEntries()) {
            System.out.println("\t" + entry.getTitle());
        }
        System.out.println(feed.getAuthor());

Can any one help me, what is this error about and how to resolve this error?


Solution

  • The response from this URL comes GZIP compressed (you must have something special in your system as in standard java 8 it will not send accept gzip by default and for me your code works just fine).

    To handle this you can just uncompress the stream. Note, for other urls you may need to handle the case when response comes uncompressed. Also, don't forget to close resources/streams that you open.

    Here is a working sample for your url

        Abdera abdera = new Abdera();
        Parser parser = abdera.getParser();
    
        URL url = new URL(
                "http://self-learning-java-tutorial.blogspot.in/atom.xml");
        HttpURLConnection conn = (HttpURLConnection) url.openConnection();
        conn.setRequestProperty("Accept-Encoding", "gzip");
        conn.connect();
    
        try {
            String contentEncoding = conn.getContentEncoding();
            boolean isGzip = contentEncoding != null
                    && contentEncoding.contains("gzip");
            try (InputStream in = !isGzip ? conn.getInputStream()
                    : new GZIPInputStream(conn.getInputStream())) {
                Document<Feed> doc = parser.parse(in, url.toString());
                Feed feed = doc.getRoot();
                System.out.println(feed.getTitle());
                for (Entry entry : feed.getEntries()) {
                    System.out.println("\t" + entry.getTitle());
                }
                System.out.println(feed.getAuthor());
            }
        } finally {
            conn.disconnect();
        }