Search code examples
screen-scraping

What are some good methods to hinder screen scrapers from grabbing specific pieces of content off my site?


Pretty sure this question counts as blasphemy to most web 2.0 proponents, but I do think there are times when you could possibly not want pieces of your site being easily ripped off into someone else's arbitrary web aggregator. At least enough so they'd need to be arsed to do it by hand if they really wanted it.

My idea was to make a script that positioned text nodes by absolute coordinates in the order they'd appear normally within their respective paragraphs, but then stored those text nodes in a random, jumbled up order in the DOM. Of course, getting a system like that to work properly (proper text wrap, alignment, styling, etc.) seems almost akin to writing my own document renderer from scratch.

I was also thinking of combining that with a CAPTCHA-like thing to muss up the text in subtle ways so as to hinder screen scrapers that could simply look at snapshots and discern letters or whatnot. But that's probably overthinking it.

Hmm. Has anyone yet devised any good methods for doing something like this?


Solution

  • I've seen a TV guide decrypt using javascript on the client side. It wouldn't stop a determined scraper but would stop most casual scripting.

    All the textual TV entries are similar ps10825('4VUknMERbnt0OAP3klgpmjs....abd26') where ps10825 is simply a function that calls their decrypt function with a key of ps10825. Obviously the key is generate each time.

    In this case i think it's quite adequate to stop 99% of people using Greasemonkey or even wget scripts to download their TV guide without seeing all of their adverts.