Search code examples
ruby-on-railsrubyscreen-scraping

What would be the best approach to scrape multiple websites using Nokogiri?


With Nokogiri, one needs to specify the CSS classes to fetch the data contained using at_css. But how can we approach the same problem when the requirement is to scrape multiple websites, where the design and the CSS classes used would vary?


Solution

  • If your sites and scraping goals are similar enough, this is done by maintaining some data (either in code or in DB) about each target site, including paths to relevant data.

    If the pages are radically different, then you usually have no choice but to write bespoke code for each page.

    You can combine the strategies, and have part of your data a designation of which code to use (in Ruby, which scraper Class or Module to invoke), and the remainder of the data specifying suitable parameters.

    Usually the strategies and code evolve over time, it is unlikely you will start with a full understanding of how to scrape all targets. Constant re-factoring is a good development model here, if one goal is to have a maintainable codebase.