Search code examples
web-scrapingdata-collection

Protection from Web Scraping


I am currently part of a team developing an application which includes a front end client.

Through this client we send the user data, each user has a user-id and the client talks to our server through a RESTful API asking the server for data.

For example, let's say we have a database of books, and the user can get the last 3 books an author wrote. We value our users' time and we would like users to be able to start using the product without explicit registration.

We value our database, we use our own proprietary software to populate it and would like to protect it as much as we can.

So basically the question is:

What can we do to protect ourselves from web scraping?

I would very much like to learn about some techniques to protect our data, we would like to prevent users from typing every single author name in the author search panel and fetching out the top three books every author wrote.

Any suggested reading would be appreciated.

I'd just like to mention we're aware of captchas and would like to avoid them as much as possible


Solution

  • The main strategies for preventing this are:

    • require registration, so you can limit the requests per user
    • captchas for registration and non-registered users
    • rate limiting for IPs
    • require JavaScript - writing a scraper that can read JS is harder
    • robots blocking, and bot detection (e.g. request rates, hidden link traps)
    • data poisoning. Put in books and links that nobody will want to have, that stall the download for bots that blindly collect everything.
    • mutation. Frequently change your templates, so that the scrapers may fail to find the desired contents.

    Note that you can use Captchas very flexible.

    For example: first book for each IP every day is non-captcha protected. But in order to access a second book, a captcha needs to be solved.