Search code examples
robots.txtgoogle-custom-search

Stop google from crawling specific link on each page


I made a sort of silly error in judgement when I first started developing my site using LESS CSS. With LESS, you can see real-time updates if you include #!watch at the end of the url. So, being a proper lazy developer, I made a button on my dev page that only I knew about that would append the current URL with #!watch.

However, Google is treating that href as a legitimate link, and now all my pages are being indexed twice - once for the "normal" page, and once with the #!watch appended to the URL.

My question is how can I remove the !#watch from the Google indexing? Would a robot.txt line work to do that? It wouldn't really be so much of a problem, but I'm also using the Google Custom Search internally, so when a user searches within my site, I'm serving too many results for the same content.


Solution

  • What I'm going to do is set up a sitemap.xml doc with each of those offending links set to expire. I wrote a short python script to iterate over each line (some 18,000 links) and spit out the formatted xml. It looks like:

    <?xml version="1.0" encoding="UTF-8"?>
    <urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
        <url>
            <loc>http://oq.totaleclips.com/mpa/The_Rise_of_the_Guardians_The_Video_Game_(Game)#!watch</loc>      
            <expires>2012-10-08</expires>
        </url>
       ....... (many more url entries)
    </urlset>
    

    Note the <expires>tag, which is read by Google, if not other search engines, as a cut-off date for the indexing. They'll still show up for 30-60 days, apparently, and then will stopped being returned as search results.