Search code examples
rubynokogirinet-http

Wait for selector to present


When doing web scraping with Nokogiri I occasionally get the following error message

 undefined method `at_css' for nil:NilClass (NoMethodError)

I know that the selected element is present at some time, but the site is sometimes a bit slow to respond, and I guess this is the reason why I'm getting the error.

Is there some way to wait until a certain selector is present before proceeding with the script?

My current http request block looks like this

url = URL
body = BODY
uri = URI.parse(url)
http = Net::HTTP.new(uri.host, uri.port)
http.read_timeout = 200 # default 60 seconds
http.open_timeout = 200 # default nil
http.use_ssl = true
request = Net::HTTP::Post.new(uri.request_uri)
request.body = body
request["Content-Type"] = "application/x-www-form-urlencoded"
begin
  response = http.request(request)
  doc = Nokogiri::HTML(response.body)
rescue
  sleep 100
  retry
end

Solution

  • While you can use a streaming Net::HTTP like @Stefan says in his comment, and an associated handler that includes Nokogiri, you can't parse a partial HTTP document using a DOM model, which is Nokogiri's default, because it expects the full document also.

    You could use Nokogiri's SAX parser, but that's an entirely different programming style.

    If you're retrieving an entire page, then use OpenURI instead of the lower-level Net::HTTP. It automatically handles a number of things that Net::HTTP will not do by default, such as redirection, which makes it a lot easier to retrieve pages and will greatly simplify your code.

    I suspect the problem is either that the site is timing out, or the tag you're trying to find is dynamically loaded after the real page loads.

    If it's timing out you'll need to increase your wait time.

    If it's dynamically loading that markup, you can request the main page, locate the appropriate URL for the dynamic content and load it separately. Once you have it, you can either insert it into the first page if you need everything, or just parse it separately.