html web-scraping scrapy web-crawler jsoup

Web crawling for multiple websites with different structures

I want to do a web crawl on multiple websites with different structures to find a specific data. However, I have some keywords to help me find what I want. To be more clear, I want to extract a list of profs names from a university's website and loop it on a given list of universities. The keywords, here, can be the word "Professor" or "Prof" or "Dr" before their names and an email after their names. However, it's a bit challenging to deal with different html structures that each website has.

What's your suggestion?

Solution

It depends.

Option 1: If "multiple websites" means a handful, maybe up to ten, you could try building a separate scraper for each of them.

Advantage: you get exact results and you get all results.

Disadvantage: whenever a site changes the scraper breaks and needs adjustments and this will be too much work when there are 100s of sites or more.

Option 2: If "multiple websites" means really many websites, building a scraper for each site is most probably too expensive. In this case the only other option I can think of is to build a generic crawler that crawls all sites and then run NLP algorithms on the results to extract the data you need.

I gave on overview how such an NLP based processing pipeline would look like in a recent, somewhat similar question: How to crawl thousands of pages using scrapy?

Advantage: once it is running and fine tuned, it doesn't matter whether there are 100s or 1000s of sites to process and it is quite robust when sites change.

Disadvantages: getting this up and running is more difficult than writing a scraper and you will never get 100% of the results neither will they be 100% accurate.

added in 2020/04: Option 3: In some markets you'll see that a handful of specialized content management systems or site templates are very common and following the pareto rule / 80/20 rule you often can cover 60-80% of all sites by implementing just a handful of specialized scrapers.

Advantage: you get exact results and you get all results and can still cover the majority of 100s or 1000s of seemingly different websites.

Disadvantage: this works only when there enough commonalities among most websites which happens usually if a small number of specialized (content management) systems are widespread in this "market"