I intend to fetch a large amount of data over HTTP/HTTPS using http-conduit
. In order to do this efficiently, I want to use the Accept-Encoding: deflate,gzip
header to allow the server (if supported) to transfer the data in a compressed way.
However, some of the servers I want to fetch from seem to incorrectly respond with the Content-Encoding: gzip
header while not returning gzip data.
Therefore I need to handle these cases:
In the third case, it can (in this specific case) safely be assumed, that the plaintext, uncompressed data does not look like gzip data, so Response headers say it is gzipped && un-gzip fails is fully equivalent to The data is not compressed.
How can I do this using http-conduit
?
Note: This question intentionally does not show research effort because it has been answered immediately in a Q&A-style way.
In order to make this answer more concise, we will use code & concepts from some of my previous posts:
To avoid redundancy, I will first explain the basic steps and then provide a full example.
First, we shall handle sending the headers. Note that wile http-types
contains hContentEncoding
, hAcceptEncoding
is not predefined. Besides that, this is a trivial task.
After sending the request, we need to check if there is a Content-Encoding
. If there is none, we shall assume uncompressed plaintext, else we need to check if it is gzip
or deflate
. Which one it is exactly does not matter in this context as [zlib
] supports automatic detection by header.
For this rather simple example we just assume that if the server returns a Content-Encoding
value that is neither absent nor gzip
nor deflate
, the response is not compressed. As we did not allow (by Accept-Encoding
) other compressions like sdch
, the server would be violating the HTTP standard by acting that way.
If we detect a compressed encoding, we try to decompress and return it. If this fails or if the data is not compressed at all, we return the plain response body.
Here's the example:
{-# LANGUAGE OverloadedStrings #-}
import Network.HTTP.Conduit
import Network.Connection
import Codec.Compression.Zlib.Internal
import Data.Maybe
import Data.Either
import Network.HTTP.Types
import Data.ByteString.Char8 (ByteString)
import qualified Data.ByteString.Lazy.Char8 as LB
myurl :: String
myurl = "http://stackoverflow.com"
hAcceptEncoding :: HeaderName
hAcceptEncoding = "Accept-Encoding"
-- | The Accept-Encoding HTTP header value for allowing gzip or deflated responses
gzipDeflateEncoding :: ByteString
gzipDeflateEncoding = "gzip,deflate"
-- HTTP header list that allows gzipped/deflated response
compressionEnabledHeaders :: RequestHeaders
compressionEnabledHeaders = [(hAcceptEncoding, gzipDeflateEncoding)]
-- | Give an encoding string and a HTTP response object,
-- Checks if the Content-Encoding header value of the response object
-- is equal to the given encoding. Returns false if no ContentEncoding
-- header exists in the given response, or if the value does not match
-- the encoding parameter.
hasResponseEncoding :: ByteString -> Response b -> Bool
hasResponseEncoding encoding response =
let responseEncoding = lookup hContentEncoding headers
headers = responseHeaders response
in maybe False (== encoding) responseEncoding
-- | Convert the custom error format from zlib to a Either
decompressStreamToEither :: DecompressStream -> Either String LB.ByteString
decompressStreamToEither (StreamError _ errmsg) = Left errmsg
decompressStreamToEither stream@(StreamChunk _ _) = Right $ fromDecompressStream stream
decompressStreamToEither StreamEnd = Right $ ""
-- | Decompress with explicit error handling
safeDecompress :: LB.ByteString -> Either String LB.ByteString
safeDecompress bstr = decompressStreamToEither $ decompressWithErrors gzipOrZlibFormat defaultDecompressParams bstr
-- | Decompress gzip, if it fails, return uncompressed String
decompressIfPossible :: LB.ByteString -> LB.ByteString
decompressIfPossible bstr =
let conv (Left a) = bstr
conv (Right a) = a
in (conv . safeDecompress) bstr
-- | Tolerantly decompress response body. As some HTTP servers set the header incorrectly,
-- just return the plain response text if the compression fails
decompressResponseBody :: Response LB.ByteString -> LB.ByteString
decompressResponseBody res
| hasResponseEncoding "gzip" res = decompressIfPossible $ responseBody res
| hasResponseEncoding "deflate" res = decompressIfPossible $ responseBody res
| otherwise = responseBody res
-- | Download like with simpleHttp, but using an existing manager for the task
-- and automatically requesting & handling gzipped data
simpleHttpWithAutoGzip :: Manager -> String -> IO LB.ByteString
simpleHttpWithAutoGzip manager url = do req <- parseUrl url
let req' = req {requestHeaders = compressionEnabledHeaders}
fmap decompressResponseBody $ httpLbs req' manager
-- Example usage
main :: IO ()
main = do manager <- newManager conduitManagerSettings -- Create a simple manager
content <- simpleHttpWithAutoGzip manager "http://stackoverflow.com"
-- Print the uncompressed content
print $ content