Search code examples
googlebotfail2ban

fail2ban force me to ban google because of /forward in my log


In my apache log, I have a lot of stuff like this:

<IP ADDRESS> - - <DATE> "GET /forward?path=http://vary_bad_link_not_for_children" <NUM1> <NUM2> "-" <String>

<NUM1>: 302 or 404

<NUM2>: 5XX, 6XX or 11XX

<String>:

"Mozilla/5.0 (compatible; AhrefsBot/5.1; +http://ahrefs.com/robot/)"

"Mozilla/5.0 (compatible; MJ12bot/v1.4.5; http://www.majestic12.co.uk/bot.php?+)"

"Mozilla/5.0 (compatible; Googlebot/2.1; +...a link)"

"Mozilla/5.0 (compatible; Exabot/3.0; +...a link)"

etc...

I have made a jail for fail2ban with this regex:

failregex = ^<HOST> .*"GET .*/forward\?path=

Everything is working fine except that the IP address that are banned (see <IP ADDRESS> in the log) are the IP of google and other very well known companies.

I really don't understand why it is like this; I mean why should I ban google and the other companies and If not, Why should I accept all those inappropriate request to my server.

I would like to clarify my questions, as it was poorly explained:

1-Why Google IP (and other known companies) are doing those kind of "porn" requests

2-Is there any meaning to "/forward?path=..." is it an apache feature?

3-How to handle this problem without stopping the "good" bots to reference my sites.

Thanks by advance for any help!


Solution

  • You can tell robots not to visit parts of your site in your robots.txt.

    Adding

    User-agent: *
    Disallow: /forward
    

    to your robots.txt will keep all bots away from visiting all pages beginning with /forward. They will continue to visit and index other pages.

    If you want to allow /forward?path=something_nice but not /forward?path=very_bad_link, you can do that:

    User-agent: *
    Disallow: /forward?path=a_specific_bad_link
    Disallow: /forward?path=another_bad_link
    

    Why are bots making these requests?

    This may be entirely innocent. Perhaps someone has mistakenly linked to your site, perhaps the page used to exist and no longer does.

    This may be due to a link on your own site that points to this URL. Check for that.

    In the worst case, it might be people using you as an unwitting proxy. Make sure that the server does not serve anything when /forward is requested, and check the logs for anything else suspicious.

    What if the requests continue?

    It may take a while for the requests to stop. Robots do not request your robots.txt every time, and you will have to wait for them to update.

    However, if they don't eventually stop, it means they are malicious bots, and spoofing the Googlebot user-agent. robots.txt provides instructions to the robot. Good-willed bots honour them, but they can't force a malicious robot to stay away. You then need a solution like fail2ban.