Search code examples
rubyweb-scrapingmechanizemechanize-ruby

Unexplained Inconsistency when Downloading an XLS file with Ruby Mechanize after redirect


I have a script that visits fcc.gov, then clicks a link which triggers a download:

require "mechanize"

docket_number = "12-268" #"96-128"

url = "http://apps.fcc.gov/ecfs/comment_search/execute?proceeding=#{docket_number}"
agent = Mechanize.new
agent.pluggable_parser.default = Mechanize::DirectorySaver.save_to 'downloads'

agent.get(url) do |page|
    link = page.link_with(:text => "Export to Excel file")
    xls = agent.click(link)
end

This works fine when docket_number is "12-268". But when you change it to "96-128", Mechanize downloads the html of the page instead of the desired spreadsheet.

The urls for both pages are:

As you can see, if you visit each page in a browser (I'm using Chrome) and click "Export to Excel file", a spreadsheet file is downloaded and there is not problem. "96-128" has many more rows, so when you click on the Export link, it takes you to a new page that refreshes every 10 seconds or so until the file begins downloading. How can I get around this and why is there this inconsistency?


Solution

  • Clicking Export on 96-128 takes you to a page that refreshes using this kind of a tag (I've never heard of it before):

    <meta http-equiv="refresh" content="5;url=/ecfs/comment_search/export?exportType=xls"/>
    

    By default, Mechanize will not follow these refreshes. To get around that, change a setting on agent:

    agent.follow_meta_refresh = true
    

    Source: https://stackoverflow.com/a/2166480/94154