Search code examples
javahttpgzipchunked-encodinghttp-1.1

Reading gzipped chunked data HTTP 1.1 in Java


I am trying to get body of an HTTP request with gzipped data + chunked encoding. The code I am using:

byte[] d; // *whole* request body

ByteArrayOutputStream b = new ByteArrayOutputStream();

int c = 0;
int p = 0;

int s = 0;

for(int i = 0; i < d.length; ++i) {
    if (s == 0 && d[i] == '\r' && d[i + 1] == '\n') {
        c = Integer.parseInt(new String(Arrays.copyOfRange(d, p+1, i)), 16);

        if(c == 0) break;

        b.write(Arrays.copyOfRange(d, i+2, i+2+c));

        p = i + 1;
        i += c + 1;

        s = 1;
    } else if (s == 1 && d[i] == '\r' && d[i + 1] == '\n') {
        p = i + 1;
        s = 0;
    }
}

// here comes the part where I decompress  b.toByteArray()

In short, the program reads chunk size and writes part of the whole request (from '\n' to the '\n'+chunk size) to the ByteArrayOutputStream b and repeat the process until chunk with size 0 is found.

If I try to decompress such data I always get some corrupted data warning, e.g. java.util.zip.ZipException: invalid distance too far back.

Any thoughts what I might be doing wrong?


Solution

  • Obligatory preamble: in a professional context, I'd always use a library for this. See Apache HttpComponents for example, that would handle that (and much more) for you. If you don't want a library, and like risk, there is sun.net.www.http.ChunkedInputStream in the JRE.

    Also, in a pro context descriptive variable names would be preferred :)

    Anyway, I saw a mistake: p should be initialized with -1, not 0.

    It seems that's all, because with that fix I can decode the following (courtesy Wikipedia):

    4\r\n
    Wiki\r\n
    5\r\n
    pedia\r\n
    E\r\n
     in\r\n
    \r\n
    chunks.\r\n
    0\r\n
    \r\n
    

    into this:

    Wikipedia in
    
    chunks.
    

    (yes that is the expected output, see Wikipedia page).

    If you initialize p to 0, then the first time you need to use it to read 4, you are using p+1 so it points after the 4.

    I realize my example is not gzipped, but my point is that the error is in the code that reads the size of first chunk, so it should not matter... and with some luck that will be the only mistake.