I have written a web scraper in ruby . But the websites that I am scraping hav changed their design.Thus my scraper is failing. Is there a smart and simple solution to solve this kind of an inherent problem of scrapers? (for eg.. using some kind of pattern matching, xpaths,comparing DOM tress...etc)
EM.run {
http_request = EM::HttpRequest.new(url, opts).get
http_request.callback { |body|
doc = Nokogiri.parse(body.response)
doc = Nokogiri::HTML(body.response)
puts doc.css(".poster_information")
puts doc.css(".date")
puts doc.css(".comment_block")
}
In above example code snippet I am scraping the the above mentioned website for poster information , date posted and comments posted with the help of css selectors for one web page. Now suppose if the webmaster changes the layout of the forum. The css selectors will fail and thus my whole scraper will fail. I do not want to update my scraper everytime the website's layout changes. So is there any way that my scraper detect the website layout change and it would be able to correctly find the path to the desired destination?Becuase I have no way to know when the website will change.. I am just trying to make my scraper automated and fault tolerant
You can write integration tests that are periodically run to notify you when the pages change. If the page structure changes frequently, I would also extract the selector patterns into a config and may build a UI to easily edit which selectors I want to actually scrape. As a side note, you might also be interested in checking out capybara to control the scraper at a higher level. capybara-webkit is available if you need JS capabilities as well.