Search code examples
securityscreen-scraping

how can I protect scraping of certain data on my web pages?


I want to protect only certain numbers that are displayed after each request. There are about 30 such numbers. I was planning to have images generated in the place of those numerbers, but if the image is not warped as with captcha, wont scripts be able to decipher the number anyway? Also, how much of a performance hit would loading images be vs text?


Solution

  • It's not possible.

    • You use javascript and encrypt the page, using document.write() calls after decrypting. I either scrape from the browser's display or feed the page through a JS engine to get the output.
    • You use Flash. I can poke into the flash file and get the values. You encrypt them in the flash and I can just run it then grab the output from the interpreter's display as a sequence of images.
    • You use images and I can just feed them through an OCR.

    You're in an arms race. What you need to do is make your information so useful and your pages so easy to use that you become the authority source. It's also handy to change your output formats regularly to keep up, but screen scrapers can handle that unless you make fairly radical changes. Radical changes drive users away because the page is continually unfamiliar to them.

    Your image solution wont' help much, and images are far less efficient. A number is usually only a few bytes long in HTML encoding. Images start at a few hundred bytes and expand to a 1k or more depending on how large you want. Images also will not render in the font the user has selected for their browser window, and are useless to people who use assisted computing devices (visually impaired people).