Search code examples
web-scrapingweb-crawlerworkflowraw-data

web scraping design - best practice


I have implemented a few web scraping projects - ranging from small to mid size (around 100.000 scraped pages) - in my past. Usually my starting point is an index page that links to several pages with the details I want to scrape. In the end most of the time my projects worked. But I always feel like I could improve the workflow (especially regarding the challenge of reducing the traffic I cause to the scraped web sites [and connected to that topic: the risk of being banned :D]).

That's why I was wondering about your (best practice) approaches of web scraper designs (for small and mid size projects).

Usually I build my web scraping projects like that:

  1. I identify a starting point, which contains the urls I want scrape data from. The starting point has quite a predictable structre which makes it easy to scrape

  2. I take a glimpse at the endpoints I want to scrape and figure out some functions to scrape and process data

  3. I collect all the urls (endpoints) I want to scrape from my starting point and store them in a list (sometimes the starting point are several pages ... for example if search results are displayed and one page only shows 20 results ... but the structure of these pages is almost identical)

  4. I start crawling the url_list and scrape the data I am interested in.

  5. To scrape the data, I run some functions to structure and store the data in the format I need

  6. Once I have sucessfully scraped the data, I mark the url as "scraped" (if I run into errors, timeouts or something similar, I don't have to start from the beginning, but can continue from where the process stopped)

  7. I combine all the data I need and finish the project

Now I am wondering if it could be a good idea to modify this workflow and stop extracting/processing data while crawling. Instead I would collect the raw data/the website, mark the url as crawled and continue crawling. When all websites are downloaded (or - if it is a bigger project - between bigger tasks) I would run functions to process and store the raw data.

Benefits of this approach would be:

  • if I run into errors based on unexpected structure I would not have to re-scrape all the pages before. I would only have to change my code and run it on the stored raw data (which would minimize the traffic I cause)
  • as websites keep changing I would have a pool of reproducable data

Cons would be:

  • especially if projects grow in size this approach could require too much space

Solution

  • Without knowing your goal, it's hard to say, but I think it's a good idea as far as debugging goes.

    For example, if the sole purpose of your scraper is to record some product's price, but your scraper suddenly fails to obtain that data, then yes- it would make sense to kill the scraper.

    But let's say the goal isn't just the price, but various attributes on a page, and the scraper is just failing to pick up on one attribute due to something like a website change. If that were the case, and there is still value in scraping the other data attributes, then I would continue scraping, but log the error. Another consideration would be the failure rate. Web scraping is very finicky- sometimes web pages load differently or incompletely, and sometimes websites change. Is the scraper failing 100%? Or perhaps it is just failing 5% of the time?

    Having the html dump saved on error certainly would help debug issues like xpath failing and such. You could minimize the amount of space consumed by more careful error handling. For example, save a file containing an html dump if one doesn't already exist for this specific error of, for example, an xpath failing to return a value, or a type mismatch, etc.

    Re: getting banned. I would recommend using a scraping framework. For example, in python there is Scrapy which handles the flow of requests. Also, proxy services exist to avoid getting banned. In the US at least, web scraping has been explicitly deemed legal. All companies account for web scraping traffic. You aren't going to break a service with 100k scrapes. Think about the millions of scrapes a day Walmart does on Amazon, and vice versa.