Search code examples
web-crawlerscreen-scrapingmonitoring

How to protect/monitor your site from crawling by malicious user


Situation:

  • Site with content protected by username/password (not all controlled since they can be trial/test users)
  • a normal search engine can't get at it because of username/password restrictions
  • a malicious user can still login and pass the session cookie to a "wget -r" or something else.

The question would be what is the best solution to monitor such activity and respond to it (considering the site policy is no-crawling/scraping allowed)

I can think of some options:

  1. Set up some traffic monitoring solution to limit the number of requests for a given user/IP.
  2. Related to the first point: Automatically block some user-agents
  3. (Evil :)) Set up a hidden link that when accessed logs out the user and disables his account. (Presumably this would not be accessed by a normal user since he wouldn't see it to click it, but a bot will crawl all links.)

For point 1. do you know of a good already-implemented solution? Any experiences with it? One problem would be that some false positives might show up for very active but human users.

For point 3: do you think this is really evil? Or do you see any possible problems with it?

Also accepting other suggestions.


Solution

  • Point 1 has the problem you have mentioned yourself. Also it doesn't help against a slower crawl of the site, or if it does then it may be even worse for legitimate heavy users.

    You could turn point 2 around and only allow the user-agents you trust. Of course this won't help against a tool that fakes a standard user-agent.

    A variation on point 3 would just be to send a notification to the site owners, then they can decide what to do with that user.

    Similarly for my variation on point 2, you could make this a softer action, and just notify that somebody is accessing the site with a weird user agent.

    edit: Related, I once had a weird issue when I was accessing a URL of my own that was not public (I was just staging a site that I hadn't announced or linked anywhere). Although nobody should have even known this URL but me, all of a sudden I noticed hits in the logs. When I tracked this down, I saw it was from some content filtering site. Turned out that my mobile ISP used a third party to block content, and it intercepted my own requests - since it didn't know the site, it then fetched the page I was trying to access and (I assume) did some keyword analysis in order to decide whether or not to block. This kind of thing might be a tail end case you need to watch out for.