Search code examples
haskellgzipcontent-encodinghttp-conduit

http-conduit: Transfer data using gzip


I intend to fetch a large amount of data over HTTP/HTTPS using http-conduit. In order to do this efficiently, I want to use the Accept-Encoding: deflate,gzip header to allow the server (if supported) to transfer the data in a compressed way.

However, some of the servers I want to fetch from seem to incorrectly respond with the Content-Encoding: gzip header while not returning gzip data.

Therefore I need to handle these cases:

  • Server does not support compression --> Return plain response body
  • Server returns gzipped/deflated content --> Return decompressed response body
  • Server says (in response headers it returns gzipped content, but gzip decoding fails --> Return plain response body

In the third case, it can (in this specific case) safely be assumed, that the plaintext, uncompressed data does not look like gzip data, so Response headers say it is gzipped && un-gzip fails is fully equivalent to The data is not compressed.

How can I do this using http-conduit?

Note: This question intentionally does not show research effort because it has been answered immediately in a Q&A-style way.


Solution

  • In order to make this answer more concise, we will use code & concepts from some of my previous posts:

    • simpleHttpWithManager from here
    • Tolerant gzip/deflate decoding from here

    To avoid redundancy, I will first explain the basic steps and then provide a full example.

    First, we shall handle sending the headers. Note that wile http-types contains hContentEncoding, hAcceptEncoding is not predefined. Besides that, this is a trivial task.

    After sending the request, we need to check if there is a Content-Encoding. If there is none, we shall assume uncompressed plaintext, else we need to check if it is gzip or deflate. Which one it is exactly does not matter in this context as [zlib] supports automatic detection by header.

    For this rather simple example we just assume that if the server returns a Content-Encoding value that is neither absent nor gzip nor deflate, the response is not compressed. As we did not allow (by Accept-Encoding) other compressions like sdch, the server would be violating the HTTP standard by acting that way.

    If we detect a compressed encoding, we try to decompress and return it. If this fails or if the data is not compressed at all, we return the plain response body.

    Here's the example:

    {-# LANGUAGE OverloadedStrings #-}
    import Network.HTTP.Conduit
    import Network.Connection
    import Codec.Compression.Zlib.Internal
    import Data.Maybe
    import Data.Either
    import Network.HTTP.Types
    import Data.ByteString.Char8 (ByteString)
    import qualified Data.ByteString.Lazy.Char8 as LB
    
    myurl :: String
    myurl = "http://stackoverflow.com"
    
    hAcceptEncoding :: HeaderName
    hAcceptEncoding = "Accept-Encoding"
    
    -- | The Accept-Encoding HTTP header value for allowing gzip or deflated responses
    gzipDeflateEncoding :: ByteString
    gzipDeflateEncoding = "gzip,deflate"
    
    -- HTTP header list that allows gzipped/deflated response
    compressionEnabledHeaders :: RequestHeaders
    compressionEnabledHeaders = [(hAcceptEncoding, gzipDeflateEncoding)]
    
    -- | Give an encoding string and a HTTP response object,
    --   Checks if the Content-Encoding header value of the response object
    --   is equal to the given encoding. Returns false if no ContentEncoding
    --   header exists in the given response, or if the value does not match
    --   the encoding parameter.
    hasResponseEncoding :: ByteString -> Response b -> Bool
    hasResponseEncoding encoding response =
        let responseEncoding = lookup hContentEncoding headers
            headers = responseHeaders response
        in maybe False (== encoding) responseEncoding
    
    -- | Convert the custom error format from zlib to a Either
    decompressStreamToEither :: DecompressStream -> Either String LB.ByteString
    decompressStreamToEither (StreamError _ errmsg) = Left errmsg
    decompressStreamToEither stream@(StreamChunk _ _) = Right $ fromDecompressStream stream
    decompressStreamToEither StreamEnd = Right $ ""
    
    -- | Decompress with explicit error handling
    safeDecompress :: LB.ByteString -> Either String LB.ByteString
    safeDecompress bstr = decompressStreamToEither $ decompressWithErrors gzipOrZlibFormat defaultDecompressParams bstr
    
    -- | Decompress gzip, if it fails, return uncompressed String
    decompressIfPossible :: LB.ByteString -> LB.ByteString
    decompressIfPossible bstr =
        let conv (Left a) = bstr
            conv (Right a) = a
        in (conv . safeDecompress) bstr
    
    -- | Tolerantly decompress response body. As some HTTP servers set the header incorrectly,
    --   just return the plain response text if the compression fails
    decompressResponseBody :: Response LB.ByteString -> LB.ByteString
    decompressResponseBody res
        | hasResponseEncoding "gzip" res = decompressIfPossible $ responseBody res
        | hasResponseEncoding "deflate" res = decompressIfPossible $ responseBody res
        | otherwise = responseBody res
    
    -- | Download like with simpleHttp, but using an existing manager for the task
    --   and automatically requesting & handling gzipped data
    simpleHttpWithAutoGzip :: Manager -> String -> IO LB.ByteString
    simpleHttpWithAutoGzip manager url = do req <- parseUrl url
                                            let req' = req {requestHeaders = compressionEnabledHeaders}
                                            fmap decompressResponseBody $ httpLbs req' manager
    
    -- Example usage
    main :: IO ()
    main = do manager <- newManager conduitManagerSettings -- Create a simple manager
              content <- simpleHttpWithAutoGzip manager "http://stackoverflow.com"
              -- Print the uncompressed content
              print $ content