Search code examples
rubymechanizemechanize-ruby

Problems with text/csv Content-Encoding = UTF-8 in Ruby Mechanize


When attempting to load a page which is a CSV that has encoding of UTF-8, using Mechanize V2.5.1, I used the following code:

a.content_encoding_hooks << lambda{|httpagent, uri, response, body_io|
 response['Content-Encoding'] = 'none' if response['Content-Encoding'].to_s == 'UTF-8'
}
p4 = a.get(redirect_url, nil, ['accept-encoding' => 'UTF-8'])

but I find that the content encoding hook is not being called and I get the following error and traceback:

/Users/jackrg/.rbenv/versions/1.9.2-p290/lib/ruby/gems/1.9.1/gems/mechanize-2.5.1/lib/mechanize/http/agent.rb:787:in 'response_content_encoding': unsupported content-encoding: UTF-8 (Mechanize::Error)
    from /Users/jackrg/.rbenv/versions/1.9.2-p290/lib/ruby/gems/1.9.1/gems/mechanize-2.5.1/lib/mechanize/http/agent.rb:274:in 'fetch'
    from /Users/jackrg/.rbenv/versions/1.9.2-p290/lib/ruby/gems/1.9.1/gems/mechanize-2.5.1/lib/mechanize/http/agent.rb:949:in 'response_redirect'
    from /Users/jackrg/.rbenv/versions/1.9.2-p290/lib/ruby/gems/1.9.1/gems/mechanize-2.5.1/lib/mechanize/http/agent.rb:299:in 'fetch'
    from /Users/jackrg/.rbenv/versions/1.9.2-p290/lib/ruby/gems/1.9.1/gems/mechanize-2.5.1/lib/mechanize/http/agent.rb:949:in 'response_redirect'
    from /Users/jackrg/.rbenv/versions/1.9.2-p290/lib/ruby/gems/1.9.1/gems/mechanize-2.5.1/lib/mechanize/http/agent.rb:299:in 'fetch'
    from /Users/jackrg/.rbenv/versions/1.9.2-p290/lib/ruby/gems/1.9.1/gems/mechanize-2.5.1/lib/mechanize.rb:407:in 'get'
    from prototype/test1.rb:307:in `<main>'

Does anyone have an idea why the content hook code is not firing and why I am getting the error?


Solution

  • but I find that the content encoding hook is not being called

    What makes you think that?

    The error message references this code:

      def response_content_encoding response, body_io
        ...
        ...
    
        out_io = case response['Content-Encoding']
                 when nil, 'none', '7bit', "" then
                   body_io
                 when 'deflate' then
                   content_encoding_inflate body_io
                 when 'gzip', 'x-gzip' then
                   content_encoding_gunzip body_io
                 else
                   raise Mechanize::Error,
                     "unsupported content-encoding: #{response['Content-Encoding']}"
    

    So mechanize only recognizes the content encodings: '7bit', 'deflate', 'gzip', or 'x-gzip'.

    From the HTTP/1.1 spec:

    4.11 Content-Encoding

    The Content-Encoding entity-header field is used as a modifier to the media-type. When present, its value indicates what additional content codings have been applied to the entity-body, and thus what decoding mechanisms must be applied in order to obtain the media-type referenced by the Content-Type header field. Content-Encoding is primarily used to allow a document to be compressed without losing the identity of its underlying media type.

       Content-Encoding  = "Content-Encoding" ":" 1#content-coding
    

    Content codings are defined in section 3.5. An example of its use is

       Content-Encoding: gzip
    

    The content-coding is a characteristic of the entity identified by the Request-URI. Typically, the entity-body is stored with this encoding and is only decoded before rendering or analogous usage. However, a non-transparent proxy MAY modify the content-coding if the new coding is known to be acceptable to the recipient, unless the "no-transform" cache-control directive is present in the message.

    ... ...

    3.5 Content Codings

    Content coding values indicate an encoding transformation that has been or can be applied to an entity. Content codings are primarily used to allow a document to be compressed or otherwise usefully transformed without losing the identity of its underlying media type and without loss of information. Frequently, the entity is stored in coded form, transmitted directly, and only decoded by the recipient.

       content-coding   = token
    

    All content-coding values are case-insensitive. HTTP/1.1 uses content-coding values in the Accept-Encoding (section 14.3) and Content-Encoding (section 14.11) header fields. Although the value describes the content-coding, what is more important is that it indicates what decoding mechanism will be required to remove the encoding.

    The Internet Assigned Numbers Authority (IANA) acts as a registry for content-coding value tokens. Initially, the registry contains the following tokens:

    gzip An encoding format produced by the file compression program "gzip" (GNU zip) as described in RFC 1952 [25]. This format is a Lempel-Ziv coding (LZ77) with a 32 bit CRC.

    compress The encoding format produced by the common UNIX file compression program "compress". This format is an adaptive Lempel-Ziv-Welch coding (LZW).

        Use of program names for the identification of encoding formats
        is not desirable and is discouraged for future encodings. Their
        use here is representative of historical practice, not good
        design. For compatibility with previous implementations of HTTP,
        applications SHOULD consider "x-gzip" and "x-compress" to be
        equivalent to "gzip" and "compress" respectively.
    

    deflate The "zlib" format defined in RFC 1950 [31] in combination with the "deflate" compression mechanism described in RFC 1951 [29].

    identity The default (identity) encoding; the use of no transformation whatsoever. This content-coding is used only in the Accept- Encoding header, and SHOULD NOT be used in the Content-Encoding header.
    http://www.w3.org/Protocols/rfc2616/rfc2616-sec3.html#sec3.5

    In other words, an http content encoding has nothing to do with ascii v. utf-8 v. latin-1.

    In addition the source code for Mechanize::HTTP::Agent has this in it:

      # A list of hooks to call after retrieving a response.  Hooks are called with
      # the agent and the response returned.
      attr_reader :post_connect_hooks
    
      # A list of hooks to call before making a request.  Hooks are called with
      # the agent and the request to be performed.
      attr_reader :pre_connect_hooks
    
      # A list of hooks to call to handle the content-encoding of a request.
      attr_reader :content_encoding_hooks
    

    So it doesn't even look like you are calling the right hook.

    Here is an example I got to work:

    require 'mechanize'
    
    a = Mechanize.new
    
    p a.content_encoding_hooks
    
    func = lambda do |a, uri, resp, body_io| 
      puts body_io.read
      puts "The Content-Encoding is: #{resp['Content-Encoding']}"
    
      if resp['Content-Encoding'].to_s == 'UTF-8'
        resp['Content-Encoding'] = 'none'
      end
    
      puts "The Content-Encoding is now: #{resp['Content-Encoding']}"
    end
    
    a.content_encoding_hooks << func
    
    a.get(
      'http://localhost:8080/cgi-bin/myprog.rb',
      [],
      nil,
      "Accept-Encoding" => 'gzip, deflate'  #This is what Firefox always uses
    )
    

    myprog.rb:

    #!/usr/bin/env ruby
    
    require 'cgi'
    
    cgi = CGI.new('html3')
    
    headers = {
      "type" => 'text/html',
      "Content-Encoding" => "UTF-8",
    }
    
    cgi.out(headers) do
      cgi.html() do
        cgi.head{ cgi.title{"Content-Encoding Test"} } +
        cgi.body() do
          cgi.div(){ "The Accept-Encoding was: #{cgi.accept_encoding}" }
        end
      end
    end
    
    --output:--
    []
    <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2 Final//EN"><HTML><HEAD><TITLE>Content-Encoding Test</TITLE></HEAD><BODY><DIV>The Accept-Encoding was: gzip, deflate</DIV></BODY></HTML>
    The Content-Encoding is: UTF-8
    The Content-Encoding is now: none