Search code examples
wordpressseorobots.txtgoogle-crawlers

Disallow URL with specific querystring from crawl using robots.txt


My client has an ASP.NET MVC web application that also has a WordPress blog in a subfolder.

https://www.example.com/
https://www.example.com/wordpress

The WordPress site is loaded with some social sharing links that I do not want crawlers to index. For example:

https://www.example.com/wordpress/some-post/?share=pinterest

First thing, should there be a robots.txt in the / folder and also one in the /wordpress folder? Or just a single one in the / folder? I've tried both without any success.

In my robots.txt file I've included the following:

User-agent: Googlebot
Disallow: ?share=pinterest$

I've also tried several variations like:

Disallow: /wordpress/*/?share=pinterest

No matter what rule I have in robots.txt, I'm not able to get crawlers to stop trying to index these social sharing links. The plugin that creates these sharing links is also making them "nofollow noindex noreferer", but since they are all internal links it causes issues due to blocking internal "link juice".

How do I form a rule to Disallow crawlers to index any link inside this site that ends with ?share=pinterest?

Should both sites have a robots.txt or only one in the main/root folder?


Solution

  • robots.txt should only be at the root of the domain. https://example.com/robots.txt is the correct URL for your robots.txt file. Any robots.txt file in a subdirectory will be ignored.

    By default, robots.txt rules are all "starts with" rules. Only a few major bots such as Googlebot support wildcards in Disallow: rules. If you use wildcards, the rules will be obeyed by the major search engines but ignored by most less sophisticated bots.

    Using nofollow on those links isn't really going to effect your internal link juice. Those links are all going to be external redirects that will either pass PageRank out of your site, or if you block that PageRank somehow, it will evaporate. Neither external linking, nor PageRank evaporation hurt the SEO of the rest of your site, so it doesn't really matter from an SEO perspective what you do. You can allow those links to be crawled, use nofollow on those links, or disallow those links in robots.txt. It won't change how the rest of your site is ranked.

    robots.txt also has the disadvantage that search engines occasionally index disallowed pages. robots.txt blocks crawling, but it doesn't always prevent indexing. If any of those URLs get external links, Google may index the URL with the anchor text of the links it finds to them.

    If you really want to hide the social sharing from search engine bots, you should have the functionality handled with onclick events. Something like:

    <a onclick="pintrestShare()">Share on Pinterest</a>
    

    Where pintrestShare is a JavaScript function that uses location.href set the URL of the page to the Pinterest share URL for the current URL.

    To directly answer your question about robots.txt, this rule is correct:

    User-agent: *
    Disallow: /wordpress/*/?share=pinterest
    

    You can use Google's robots.txt testing tool to verify that it blocks your URL:

    You have to wait 24 hours after making robots.txt changes before bots start obeying the new rules. Bots often cache your old robots.txt for a day.

    You may have to wait weeks for new results to show in your webmaster tools and search console accounts. Search engines won't report new results until they get around to re-crawling pages, realize the requests are blocked, and that information makes it back to their webmaster information portals.