I am working on a PHP site that allows users to post a listing for their business related to the sites theme. This includes a single link URL, some text, and an optional URL for an image file.
Example:
<img src="http://www.somesite.com" width="40" />
<a href="http://www.abcbusiness.com" target="new">ABC Business</a>
<p>
Some text about how great abc business is...
</p>
The HTML in the text is filtered using the class from htmlpurifier.org and the content is checked for bad words, so I feel pretty good about that part.
The image file URL is always placed inside a <img src="" />
tag with a fixed width and validated to be an actual HTTP URL, so that should be Ok.
The dangerous part is the link.
Question: How can I be sure that the link does not point to some SPAM, unsafe, or porn site (using code)?
I can check headers for 404, etc... but is there a quick and easy way to validate a sites content from a link.
EDIT:
I am using a CAPTCHA and do require registration before posting is allowed.
is there a quick and easy way to validate a sites content from a link.
No. There is no global white/blacklist of URLs which you can use to somehow filter out "bad" sites, especially since your definition of a "bad" site is so unspecific.
Even if you could look at a URL and tell whether the page it points to has bad content, it's trivially easy to disguise a URL these days.
If you really need to prevent this, you should moderate your content. Any automated solution is going to be imperfect and you're going to wind up manually moderating anyways.