I have checked the logs and found that the search engines visits a lot of bogus URL's on my website. They are most likely from before a lot of the links were changed, and even though I have made 301 redirects some links have been altered in very strange ways and aren't recognized by my .htaccess file.
All requests are handled by index.php. If a response can't be created due to a bad URL a custom error page is presented instead. With simplified code index.php looks like this
try {
$Request = new Request();
$Request->respond();
} catch(NoresponseException $e) {
$Request->presentErrorPage();
}
I just realized that this page returns a status 200 telling the bot that the page is valid even though it ain't.
Is it enough to add a header with 404 in the catch statement to tell the bots to stop visiting that page?
Like this:
header("HTTP/1.0 404 Not Found");
It looks OK when I tests it, but I'm worried that SE bots (and maybe user agents) will get confused.
You're getting there. The idea is correct - you want to give them a 404. However, just one tiny correction: if the client queries using HTTP/1.1 and you answer using 1.0, some clients will get confused.
The way around this is as follows:
header($_SERVER['SERVER_PROTOCOL']." 404 Not Found");