single-page-application sitemap google-search robots.txt google-search-console

Google 'Sitemap contains urls which are blocked by robots.txt' warning

The problem is that whitelisting robots.txt with Disallow: / doesn't work as expected with Google.

Google has issues with restricting robots.txt rules:

User-agent: *
Host: sitename
Allow: /$
Allow: /sitemap.xml
Allow: /static/
Allow: /articles/
Disallow: /
Disallow: /static/*.js$

Where sitemap.xml contains / and numerous /articles/... URLs.:

<url><loc>http://sitename/</loc><changefreq>weekly</changefreq></url>
<url><loc>http://sitename/articles/some-article</loc><changefreq>weekly</changefreq></url>
<url><loc>http://sitename/articles/...</loc><changefreq>weekly</changefreq></url>
...

Crawl / robots.txt Tester in Google search console correctly interprets it, it shows these URLs as allowed ('Fetch as Google' works as well):

sitename/

sitename/articles/some-article

However, the report in Crawl / Sitemaps shows that sitemap.xml has got issues for all /articles/... URLs, the warning is:

Sitemap contains urls which are blocked by robots.txt

Thus, only / is indexed (it was even removed from index at some point, though Google never complained at it in sitemap report).

The reason behind this setup is that Google is unable to render properly SPA routes, so some SPA routes (/ and /articles/...) were prerendered as fragments and allowed for crawling (other routes aren't prerendered yet, it is not desirable to make them available for crawling at the moment).

I temporary replaced Disallow: / with blacklist of all known routes without fragments, and the problem disappeared:

User-agent: *
Host: sitename
Allow: /$
Allow: /sitemap.xml
Allow: /static/
Allow: /articles/
Disallow: /blacklisted-route1
Disallow: /blacklisted-route2
...
Disallow: /static/*.js$

What is the problem with the former approach? Why does Google behave like that?

robots.txt rules are quite unambiguous, and Google's robots.txt tester only confirms that.

Solution

When you allow /$ and disallow /, disallow wins (see Order of precedence for group-member records in https://developers.google.com/search/reference/robots_txt ).

Forget about my earlier comment about last rule prevails on first rule. It does not apply in your case.

To remove fragments, use a canonical tag. If you don't want Google to crawl your pages, set a nofollow.