In my apache log, I have a lot of stuff like this:
<IP ADDRESS> - - <DATE> "GET /forward?path=http://vary_bad_link_not_for_children" <NUM1> <NUM2> "-" <String>
<NUM1>: 302 or 404
<NUM2>: 5XX, 6XX or 11XX
<String>:
"Mozilla/5.0 (compatible; AhrefsBot/5.1; +http://ahrefs.com/robot/)"
"Mozilla/5.0 (compatible; MJ12bot/v1.4.5; http://www.majestic12.co.uk/bot.php?+)"
"Mozilla/5.0 (compatible; Googlebot/2.1; +...a link)"
"Mozilla/5.0 (compatible; Exabot/3.0; +...a link)"
etc...
I have made a jail for fail2ban with this regex:
failregex = ^<HOST> .*"GET .*/forward\?path=
Everything is working fine except that the IP address that are banned (see <IP ADDRESS> in the log) are the IP of google and other very well known companies.
I really don't understand why it is like this; I mean why should I ban google and the other companies and If not, Why should I accept all those inappropriate request to my server.
I would like to clarify my questions, as it was poorly explained:
1-Why Google IP (and other known companies) are doing those kind of "porn" requests
2-Is there any meaning to "/forward?path=..." is it an apache feature?
3-How to handle this problem without stopping the "good" bots to reference my sites.
Thanks by advance for any help!
Adding
User-agent: *
Disallow: /forward
to your robots.txt will keep all bots away from visiting all pages beginning with /forward
. They will continue to visit and index other pages.
If you want to allow /forward?path=something_nice
but not /forward?path=very_bad_link
, you can do that:
User-agent: *
Disallow: /forward?path=a_specific_bad_link
Disallow: /forward?path=another_bad_link
This may be entirely innocent. Perhaps someone has mistakenly linked to your site, perhaps the page used to exist and no longer does.
This may be due to a link on your own site that points to this URL. Check for that.
In the worst case, it might be people using you as an unwitting proxy. Make sure that the server does not serve anything when /forward
is requested, and check the logs for anything else suspicious.
It may take a while for the requests to stop. Robots do not request your robots.txt every time, and you will have to wait for them to update.
However, if they don't eventually stop, it means they are malicious bots, and spoofing the Googlebot user-agent. robots.txt provides instructions to the robot. Good-willed bots honour them, but they can't force a malicious robot to stay away. You then need a solution like fail2ban.