Search code examples
stormcrawler

On WARC-Type of entries in StormCrawler WARC files


Following an upgrade of our crawler from StormCrawler 1.8 to 1.14 we have noticed that response type of our WARC entries had changed from "WARC-Type: response" to "WARC-Type: resource". Any suggestion on how to switch back to "WARC-Type: response"?


Solution

  • Nothing has changed in the WARCRecordFormat between 1.8 and 1.14 - if there is a verbatim HTTP response header available, a response record is written. If there is no HTTP header, a WARC resource record is used instead.

    In order to store the HTTP headers, the following configuration is required:

    http.store.headers: true
    
    http.protocol.implementation: com.digitalpebble.stormcrawler.protocol.okhttp.HttpProtocol
    https.protocol.implementation: com.digitalpebble.stormcrawler.protocol.okhttp.HttpProtocol
    

    More information is found in the README of the WARC module.