Search code examples
javacrawler4jhttp-content-length

Is it possible to ignore Http Content-Length?


I am using Crawler4J to collect information about a website. But sometimes I get the following error:

INFORMATION: Exception while fetching content for: {someurl} [Premature end of Content-Length delimited message body (expected: X; received: Y]

(To me) it is not clear if it just happens if X < Y or vice versa, too.

The Exception is thrown in "fetcher.PageFetchResult.java" in fetchContent (I guess when getting the response headers).

So my question is: Is there any possibility to (generally) ignore the http content-length and get the information anyway?

I already looked up the crawler4j issues but there is no similar problem.

Maybe somebody of the stackoverflow community has an idea how to solve this.

Thank you very much,

Hisushi

EDIT

Code (snippet) that throws this exception:

public boolean fetchContent(Page page) {
    try {
        page.load(entity);
        page.setFetchResponseHeaders(responseHeaders);
        return true;
    } catch (Exception e) {
        logger.log(Level.INFO, "Exception while fetching content for: " + page.getWebURL().getURL() + " [" + e.getMessage()
                + "]");
    }
    return false;
}

responseHeaders and entity is null (by default):

protected HttpEntity entity = null;
protected Header[] responseHeaders = null;

Solution

  • Premature end of Content-Length delimited message body usually means you got disconnected from the server before reading the whole content length, just include a retry mechanism in your code so that you can try again and then get the full body.