With regard to Google's AJAX crawling spec, if the server returns one thing (namely, a JavaScript-heavy file) for a #!
URL and something else (namely, a "html snapshot" of the page) to Googlebot when the #!
is replaced with ?_escaped_fragment_=
, that feels like cloaking to me. After all, how is Googlebot sure that the server is returning good faith equivalents for both the #!
and ?_escaped_fragment_=
URLs. Yet this is what the AJAX crawling spec actually tells webmasters to do. Am I missing something? How is Googlebot sure that the server is returning the same content in both cases?
The crawler does not know. But it never knows even for sites that return plain ol' html either - it is extremely easy to write code that cloaks the site based on http headers used by crawlers or known IP headers.
See this related question: How does Google Know you are Cloaking?
Most of it seems like conjecture, but it seems likely there are various checks in-place, varying between spoofing normal browser headers and actual real-person looking at the page.
Continuing the conjecture, it certainly wouldn't be beyond the capabilities of programmers at Google to write a form of crawler that actually retrieved what the user sees - after all, they have their own browser that does just that. It would be prohibitively CPU-expensive to do that all the time, but probably makes sense for the occasional spot-check.