python parsing web-scraping web-crawler dom-traversal

Web scraping: finding element after a DOM Tree change

I am relatively new to web scraping/crawlers and was wondering about 2 issues in the event where a parsed DOM element is not found in the fetched webpage anymore:

1- Is there a clever way to detect if the page has changed? I have read that it's possible to store and compare hashes but I am not sure how effective it is.

2- In case a parsed element is not found in the fetched webpage anymore, if we assume that we know that the same DOM element still exists somewhere in the DOM Tree in a different location, is there a way to somehow traverse the DOM Tree efficiently without having to go over all of its nodes?

I am trying to find out how experienced developers deal with those two issues and would appreciate insights/hints/strategies on how to manage them.

Thank you in advance.

Solution

I didn't see this in your tag list so I thought I'd mention this before anything else: a tool called BeautifulSoup, designed specifically for web-scraping.

Web scraping is a messy process. Unless there's some long-standing regularity or direct relationship with the web site, you can't really rely on anything remaining static in the web page - certainly not when you scale to millions of web pages.

With that in mind:

There's no one-fit-all solution. Some ideas:
- Use RSS, if available.
- Split your scraping into crude categories where some categories have either implied or explicit timestamps (eg: news sites) you can use to trigger an update on your end.
- You already mentioned this but hashing works quite well and is relatively cheap in terms of storage. Another idea here is to not hash the entire page but rather only dynamic or elements of interest.
- Fetch HEAD, if available.
- Download and store previous and current version of the files, then use a utility like diff.
- Use a 3rd party service to detect a change and trigger a "refresh" on your end.

Obviously each of the above has its pros and cons in terms of processing, storage, and memory requirements.

As of version 4.x of BeautifulSoup you can use different HTML parsers, namely, lxml, which should allow you to use XPath. This will definitely be more efficient than traversing the entire tree manually in a loop.

Alternatively (and likely even more efficient) is using CSS selectors. The latter is more flexible because it doesn't depend on the content being in the same place; of course this assumes the content you're interested in retains the CSS attributes.

Hope this helps!