Search code examples
web-scrapingscrapyscrapy-shellweb-mining

Defensive web scraping techniques for scrapy spider


I have been web scraping for about 3 months now, and I have noticed that many of my spiders need to be constantly babysat, because of websites changing. I use scrapy, python, and crawlera to scrape my sites. For example, 2 weeks ago I created a spider and just had to rebuild it due to the website changing their metatags from singular to plural (so location became locations). Such a small change shouldn't be able to really mess with my spiders, so I would like to take a more defensive approach to my collections moving forward. Does anyone have any advice for web scraping to allow for less babysitting? thank you in advance!


Solution

  • Since you didn't post any code I can only give general advice.

    1. Look if there's a hidden API that retrieves the data you're looking for. Load the page in Chrome. Inspect with F12 and look under Network tab. Click CTRL + F and you can search for the text you see on screen which you want to collect. If you find any file under the Network tab that contains the data as json, that is more reliable since the backend of a webpage will change less frequent than the frontend.

    2. Be less specific with selectors. Instead of doing body > .content > #datatable > .row::text you can change to #datatable > .row::text. Then your spider will be less likely to break on small changes.

    3. Handle errors with try except so to stop the whole parse function from ending if you're expecting some data might be inconsistent.