Search code examples
httpchecksum

HTTP mechanisms to ensure downloaded data has not been corrupted during download?


I have a requirement to make legal documents available to mobile applications (e.g. android, iphone, etc) via HTTP. Corruption can occur over http (references: 1, 2). In my case it is imperative that the downloaded documents have not been corrupt during transmission.

One mechanism for ensuring integrity is to digitally sign the documents. This approach works well if the documents are xml, however the signing public key will need to be available and trusted by the client.

Another mechanism is to create and store a checksum of the document (e.g. MD5). The client can download the document and the checksum, and then use the checksum to verify the document.

  • Question 1: Are there any other alternative mechanisms for ensuring the integrity?
  • Question 2: Does http have any built in mechanisms for ensuring downloaded data has not been corrupted during download?
  • Question 3: What is the statical likelihood of document corruption during download over HTTP (I would prefer this answer to be backed up by statistical data)?

Solution

  • As far as I know, HTTP itself does not have any built-in checksum mechanism and your suggestion would work for ensuring the data is valid. The thing is though, HTTP is generally implemented on the Transmission Control Protocol (TCP). TCP provides reliable communication between hosts.

    Specifically, TCP itself implements error detection (using a checksum) and uses special number sequences to ensure the data arrives in the order that it was sent. If the host sending the data receives information that the receiving host did not get the data, it will resend.

    If however the HTTP implementation on the device is actually running on top of the User Datagram Protocol (UDP), it isn't reliable however it is unlikely that a device is using UDP for HTTP or at least the unreliable version (as there is a Reliable User Datagram Protocol).

    Now, I couldn't find statistics or much information at all regarding corruption of a HTTP request. Depending how mission critical you deem this to be, treat it like it would happen then. There is mention of downloading files that end up being corrupt. While these mostly seem to relate to ZIP files, I wouldn't think it is due to HTTP but rather other things inbetween like the device itself that is downloading and corrupting the information.

    Perhaps in your scenario, it is best to add your checksum if it is absolutely critically important that your information arrives in one piece.