I have a requirement to read a series of URLs from a text file and then retrieve the pages and output a list of links.
The code has issues whenever the input URLs contain fragment identifiers (#
). I tried escaping these with %23
but this didn't seem to help.
The error given is from OpenURI and is 404.
#requirements
require 'nokogiri'
require 'open-uri'
#opening each line in input text file
line_num=0
text=File.open('input.txt').read
text.gsub!(/\r\n?/, "\n")
text.each_line do |line|
print "#{line_num += 1} #{line}"
open('output.txt', 'a') { |f|
f.puts "#{line_num} #{line}"
}
uri = URI.parse(URI.encode(line.strip))
page = Nokogiri::HTML(open(uri))
links = page.css("div.product-carousel-container a")
#loop through links if present
e = 0
while e < links.length
open('output.txt', 'a') { |f|
f.puts links[e]["href"]
}
e += 1
end
end
Fragment part of a URI should not be sent to server.
From Wikipedia: Fragment Identifier
The fragment identifier functions differently than the rest of the URI: namely, its processing is exclusively client-side with no participation from the web server — of course the server typically helps to determine the MIME type, and the MIME type determines the processing of fragments. When an agent (such as a Web browser) requests a web resource from a Web server, the agent sends the URI to the server, but does not send the fragment. Instead, the agent waits for the server to send the resource, and then the agent processes the resource according to the document type and fragment value.
Strip fragment part of a URI before passing it to open
.
require "uri"
u = URI.parse "http://example.com#fragment"
u.fragment = nil
u.to_s #=> "http://example.com"