Search code examples
dotnetnukerobots.txt

DotNetNuke robots.txt not being honored by google bots


I have a multiple portal dotnetnuke install:
domain1.com
domain2.com
domain3.com etc

Server is 32 gigs, 8 cores.

I have a single robots.txt file. When google starts crawling, I see the cpu spike to 100% for hours with multiple google ip addresses. According to IIS, the url it's trying to crawl is /lmm-product-service/elmah.axd/detail?id=af51e96f-d0cd-4598-90ad-ebe980947fa6 with a new ID each time this starts. The url is the same for all current instances of the google bot but changes when the crawling starts again.

That url is not valid. When I try to go to it in a browser, I get a 404 error -- not found.

I have tried to Disallow /lmm-product-service/ in my robots.txt to no avail:

    User-agent: Googlebot
    Disallow: /*/ctl/       # Googlebot permits *
    Disallow: /admin/
    Disallow: /lmm-product-service/

It's actually not only google doing this. It's also ahrefs but I've blocked them at the firewall.

Any suggestions?


Solution

  • Ok. Keeping my fingers crossed. I took a different tack. I simply added a urlrewrite rule:

        <rule name="KillElmahRequests" enabled="true" stopProcessing="true">
            <match url=".*elmah.*" />
            <action type="AbortRequest" />
        </rule>
    

    It has been almost 90 minutes now and no issues. I still don't know why the bots are trying to crawl a url that doesn't exist and why, since it doesn't exist, it was eating up the w3wp.exe process but this seems to be working.