I made a sort of silly error in judgement when I first started developing my site using LESS CSS. With LESS, you can see real-time updates if you include #!watch at the end of the url. So, being a proper lazy developer, I made a button on my dev page that only I knew about that would append the current URL with #!watch.
However, Google is treating that href as a legitimate link, and now all my pages are being indexed twice - once for the "normal" page, and once with the #!watch appended to the URL.
My question is how can I remove the !#watch from the Google indexing? Would a robot.txt line work to do that? It wouldn't really be so much of a problem, but I'm also using the Google Custom Search internally, so when a user searches within my site, I'm serving too many results for the same content.
What I'm going to do is set up a sitemap.xml doc with each of those offending links set to expire. I wrote a short python script to iterate over each line (some 18,000 links) and spit out the formatted xml. It looks like:
<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
<url>
<loc>http://oq.totaleclips.com/mpa/The_Rise_of_the_Guardians_The_Video_Game_(Game)#!watch</loc>
<expires>2012-10-08</expires>
</url>
....... (many more url entries)
</urlset>
Note the <expires>
tag, which is read by Google, if not other search engines, as a cut-off date for the indexing. They'll still show up for 30-60 days, apparently, and then will stopped being returned as search results.