Search code examples
jsonlanguage-agnosticapiscreen-scraping

Will providing APIs help deter screen scraping?


I have been thinking quite a bit here lately about screen scraping and what a task it can be. So I pose the following question.

Would you as a site developer expose simple APIs to prevent users from screen scraping, such as JSON results?

These results could then implement caching, and they are much smaller for traffic than the huge amounts of markup that could potentially be downloaded.

I am not looking at prevention, but deterring scraping.


Scraping Bandwidth Sample
((users * (% / 100)) * ((freq * 60) * 24)) * filesize

  • users: 200,000
  • % of users using utility: 5
  • filesize: 1kb
  • freq: 1 minute

Formula:

((users * (% / 100)) * ((freq * 60) * 24)) * filesize

10,000 * 1440 * 1

14400000kb or 13.73291015625gb

Assuming your JSON result is 200 bytes that's now (10,000 * 1440 * 0.2) or 2.74658203125gb a day.

That's a change of about 11gb of traffic a day.


My Stack Overflow profile is 96k for reference.


The reason for this question prompted asking for a JSON result from users profiles:
http://stackoverflow.uservoice.com/pages/general/suggestions/101342-add-json-for-user-information

I wanted to find out if other developers would expose this type of API, and if it is worth your time to provide these APIs to reduce bandwidth.


Solution

  • Providing an API should definitely reduce the amount of screen scraping that gets done against your site. Using a good REST API is much easier and safer than screen scraping. Screens can change without notice, and that makes screen scraping code much harder to maintain. As a developer, if I need information from a site, I'd never scrape the site if the same information was available through an API.