Search code examples
javahtmlparsingjsoupioexception

java.io.IOException: Mark has been invalidated when parsing website with jsoup


When trying parse html page of website it crashes with the error:

java.io.IOException:Mark has been invalidated.

Part of my code:

String xml = xxxxxx;
try {
    Document document = Jsoup.connect(xml).maxBodySize(1024*1024*10)
            .timeout(0).ignoreContentType(true)
            .parser(Parser.xmlParser()).get();

    Elements elements = document.body().select("td.hotv_text:eq(0)");

    for (Element element : elements) {
        Element element1 = element.select("a[href].hotv_text").first();
        hashMap.put(element.text(), element1.attr("abs:href"));
    }
} catch (HttpStatusException ex) {
    Log.i("GyWueInetSvc", "Exception while JSoup connect:" + xml +" cause:"+ ex.getMessage());
} catch (IOException e) {
    e.printStackTrace();
    throw new RuntimeException("Socket timeout: " + e.getMessage(), e);
}

The size of website which I want parse is about 2MB. And when I debug code I see that when in java package ConstrainableInputStream.java method:

public void reset() throws IOException {
    super.reset();remaining = maxSize - markpos;
} 

and returns markpos= -1 then goes to the exception.

How can I solve that problem?


Solution

  • I found solution of the problem. Problem was in buffer overloading. Solved using below code:

    BufferedReader br = null;
    
    
    try{
           connection =  new URL(xml).openConnection();
    
    
           Scanner scanner = new Scanner(connection.getInputStream());
    
    
           while (scanner.hasNextLine()) {
    
    
                 String line = scanner.nextLine();
    
    
                 content = content +line;
           }
    
    } catch (MalformedURLException e) {
    
    
           e.printStackTrace();
    
    
    } catch (IOException e) {
    
    
           e.printStackTrace();
    
    
    
    } 
    Document document = Jsoup.parse(content);