java http gzip http-compression content-encoding

Does a GunzipOutputStream - or something like it - exist?

Related to Handling HTTP ContentEncoding "deflate", I'd like to know how to use an OutputStream to inflate both gzip and deflate streams. Here's why:

I have a class that fetches resources from a web server (think wget, but in Java). I have it strictly-enforcing the Content-Length of the response and I'd like to keep that enforcement. So, what I'd like to do is read a specific number of bytes from the response (which I'm already doing) but have it generate more bytes if the response has been compressed.

I have this working for deflate responses like this:

OutputStream out = System.out;
out = new InflateOutputStream(out);
// repeatedly:
out.write(compressedBytesFromResponse);

I'd like to be able to do the same thing with gzip responses, but without a GunzipOutputStream, I'm not sure what to do, next.

Update

I was considering building something like this, but it seemed completely insane. Perhaps that is the only way to use an OutputStream to inflate my data.

Solution

Answering my own question:

There are two possibilities, here: gunzip on output (e.g. use GunzipOutputStream, not provided by the Java API), or gunzip on input (e.g. use GZIPInputStream, provided by the Java API) plus enforce the Content-Length during the reads.

I have done both, and I think I prefer the latter because a) it does not require a separate thread to be launched to pump bytes from PipedOutputStream to a PipedIOnputStream and b) (a corollary, I guess) it does not have such a threat of race-conditions and other synchronization issues.

First, here is my implementation of LimitedInputStream, which allows me to wrap the input stream and enforce a limit on the amount of data read. Note that I also have a BigLimitedInputStream that uses a BigInteger count to support Content-Length values greater than Long.MAX_LONG:

public class LimitedInputStream
    extends InputStream
{
    private long _limit;
    private long _read;
    private InputStream _in;

    public LimitedInputStream(InputStream in, long limit)
    {
        _limit= limit;
        _in = in;
        _read = 0;
    }
    @Override
    public int available()
        throws IOException
    {
        return _in.available(); // sure?
    }

    @Override
    public void close()
        throws IOException
    {
        _in.close();
    }

    @Override
    public boolean markSupported()
    {
        return false;
    }

    @Override
    public int read()
        throws IOException
    {
        int read = _in.read();

        if(-1 == read)
            return -1;

        ++_read;

        if(_read > _limit)
            return -1;
            // throw new IOException("Read limit reached: " + _limit);

        return read;
    }

    @Override
    public int read(byte[] b)
        throws IOException
    {
        return read(b, 0, b.length);
    }

    @Override
    public int read(byte[] b, int off, int len)
        throws IOException
    {
        // 'len' is an int, so 'max' is an int; narrowing cast is safe
        int max = (int)Math.min((long)(_limit - _read), (long)len);

        if(0 == max && len > 0)
            return -1;
            //throw new IOException("Read limit reached: " + _limit);

        int read = _in.read(b, off, max);

        _read += read;

        // This should never happen
        if(_read > _limit)
            return -1;
            //throw new IOException("Read limit reached: " + _limit);

        return read;
    }

    @Override
    public long skip(long n)
        throws IOException
    {
        long max = Math.min((long)(_limit - _read), n);

        if(0 == max)
            return 0;

        long read = _in.skip(max);

        _read += read;

        return read;
    }
}

Using the above class to wrap the InputStream obtained from the HttpURLConnection allows me to simplify the existing code I had to read the precise number of bytes mentioned in the Content-Length header and just blindly copy input to output. I then wrap the input stream (already wrapped in the LimitedInputStream) in a GZIPInputStream to decompress, and just pump the bytes from (doubly-wrapped) input to output.

The less-straightforward route is to pursue my original line of though: to wrap the OutputStream using (what turned out to be) an awkward class: GunzipOutputStream. I have written a GunzipOutputStream which uses an internal thread to pump bytes through a pair of piped streams. It's ugly, and it's based upon code from OpenRDF's GunzipOutputStream. I think mine is a bit simpler:

public class GunzipOutputStream
    extends OutputStream
{
    final private Thread _pump;

    // Streams
    final private PipedOutputStream _zipped;  // Compressed bytes are written here (by clients)
    final private PipedInputStream _pipe; // Compressed bytes are read (internally) here
    final private OutputStream _out; // Uncompressed data is written here (by the pump thread)

    // Internal state
    private IOException _e;

    public GunzipOutputStream(OutputStream out)
        throws IOException
    {
        _zipped = new PipedOutputStream();
        _pipe = new PipedInputStream(_zipped);
        _out = out;
        _pump = new Thread(new Runnable() {
            public void run() {
                InputStream in = null;
                try
                {
                    in = new GZIPInputStream(_pipe);

                    pump(in, _out);
                }
                catch (IOException e)
                {
                    _e = e;
                    System.err.println(e);
                    _e.printStackTrace();
                }
                finally
                {
                    try { in.close(); } catch (IOException ioe)
                    { ioe.printStackTrace(); }
                }
            }

            private void pump(InputStream in, OutputStream out)
                throws IOException
            {
                long count = 0;

                byte[] buf = new byte[4096];

                int read;
                while ((read = in.read(buf)) >= 0) {
                    System.err.println("===> Pumping " + read + " bytes");
                    out.write(buf, 0, read);
                    count += read;
                }
                out.flush();
                System.err.println("===> Pumped a total of " + count + " bytes");
            }
        }, "GunzipOutputStream stream pump " + GunzipOutputStream.this.hashCode());

        _pump.start();
    }

    public void close() throws IOException {
        throwIOException();
        _zipped.close();
        _pipe.close();
        _out.close();
    }

    public void flush() throws IOException {
        throwIOException();
        _zipped.flush();
    }

    public void write(int b) throws IOException {
        throwIOException();
        _zipped.write(b);
    }

    public void write(byte[] b) throws IOException {
        throwIOException();
        _zipped.write(b);
    }

    public void write(byte[] b, int off, int len) throws IOException {
        throwIOException();
        _zipped.write(b, off, len);
    }

    public String toString() {
        return _zipped.toString();
    }

    protected void finish()
        throws IOException
    {
        try
        {
            _pump.join();
            _pipe.close();
            _zipped.close();
        }
        catch (InterruptedException ie)
        {
            // Ignore
        }
    }

    private void throwIOException()
        throws IOException
    {
        if(null != _e)
        {
            IOException e = _e;
            _e = null; // Clear the existing error
            throw e;
        }
    }
}

Again, this works, but it seems fairly ... fragile.

In the end, I re-factored my code to use the LimitedInputStream and GZIPInputStream and didn't use the GunzipOutputStream. If the Java API provided a GunzipOutputStream, it would have been great. But it doesn't, and without writing a "native" gunzip algorithm, implementing your own GunzipOutputStream stretches the limits of propriety.