Search code examples
wicketwicket-6

Wicket stateful pages cause creeping overload from googlebot


We have been using Wicket for several different projects from version 1.5. We recently upgraded to Wicket 8 (from 6 and 7), and we have run into issues with google's crawler. It may have started when we were on Wicket 6, I'm not sure, since it kind of crept up on us...

The problem has to do with Wicket adding a pageId (version) in the url of stateful pages, and all links on that page use that same id (ajax links).

We have a singlepage application with lots of ajax links, and we are seeing googlebot traffic increases day by day. Looking at the access logs, I see that google tries urls with pageid upwards of 4 500 000 (at least, this was just a random sample) (?4529280-1.0-xxxx). Multiply that with around 100 links per page (if not more), and you see the problem. We are also seeing google trying links where the pageid is still 0, but rendercount is huge (?0-4534543.0-xxxx).

Why is this happing? I don't think it used to happen, but in that case, I don't know if Wicket changed something, or if Google changed something.

(Using setVersioned(false) does not help, since Wicket still adds and increments the pageId, as far as I can see)

We have had luck with changing another application to use stateless pages, but I'm not sure we can do that with this one, and it's a fair bit of work anyway...


Solution

  • There are no changes in the way Wicket encodes the page id in the url since 1.5.0, so it must have been the same for all your applications.

    You can tell bots to not index or to not follow links in a page with meta elements like:

    <meta name="robots" content="noindex, nofollow">
    <meta name="googlebot" content="noindex, nofollow">
    

    Or you can use robots.txt to achieve the same.

    You can also use rel="nofollow" for specific links in your page:

    <a href="https://www.example.com" rel="nofollow">example</a>
    

    And yes, generally it is recommended to use stateless pages for public pages. Stateful ones should be used for pages which are behind some kind of authentication/authorization.