Search code examples
securityiisvirtual-directoryweb-crawler

Preventing Rogue spiders from Indexing directory


We have a secure website (developed in .NET 2.0/C# running on Windows server and IIS 5) to which members have to log in and then they can view some PDF files stored in a virtual directory. To prevent spiders from crawling this website, we have a robots.txt that will disallow all user agents from coming in. However, this will NOT prevent Rogue spiders from indexing the PDF files since they will disregard the robots.txt commands. Since the documents are to be secure, I do not want ANY spiders getting into this virtual directory (not even the good ones).

Read a few articles on the web and wondering how programmers (rather than web masters) have solved this problem in their applications, since this seems like a very common problem. There are many options on the web but am looking for something that is easy and elegant.

Some options that I have seen, but seem to be weak. Listed here with their cons:

  1. Creating a Honeypot/tarpit that will allow rogue spiders to get in and then will list their IP address. Cons : this can also block valid users coming from the same IP, need to manually maintain this list or have some way for members to remove themselves from the list. We dont have a range of IPs that valid members will use, since the website is on the internet.

  2. Request header analysis : However, the rogue spiders use real agent names so this is pointless.

  3. Meta-Robots tag: Cons: only obeyed by google and other valid spiders.

There was some talk about using .htaccess which is suppose to be good but thats only will apache, not IIS.

Any suggestions are very much appreciated.

EDIT: as 9000 pointed out below, rogue spiders should not be able to get into a page that requires a login. I guess the question is 'how to prevent someone who knows the link form requesting the PDF file without logging into the website'.


Solution

  • Here is what I did (expanding on Leigh's code).

    1. Created an HTTPHandler for PDF files, created a web.config on the secure directory and configured the Handler to handle PDFs.

    2. In the handler, I check to see if the user is logged in using a session variable set by the application.

    3. If the user has the session variable, I create a fileInfo object and send it on the response. Note : don't do 'context.Response.End()', also the 'Content-Disposition' is obsolete.

    So now, where even there is a request for a PDF on the secure directory, the HTTP handler gets the request and checks to see if the user is logged in. If not, display error message, else display the file.

    Not sure if there is an performance hit since I am creating the fileInfo objects and sending that, rather than sending the file that already exists. The thing is that you can't Server.Transfer or Response.Redirect to the *.pdf file since you are creating an infinite loop and the response will never get returned to the user.