Would it be a lot easier to make a simple regex for finding urls, then have another script to see if that site sends back data or not? I've always wondered if this would be a faster, and easier solution than taking years to develop the "perfect" url detecting regex, just to have it crushed a few days later.
If anyone can find speed tests for a basic page access/load, please post it here to help answer my question.
Also, how hard would it be on the server to constantly make requests such as this, say... 100 times an hour?
I am going to test this out with Javascript, using /(http|www\.)\S+/gim
as the regex and a 60 second timeout connection to the requested url. I will do a simple "Title Grab" from the url, then record how long the trial takes. I'll post the speeds once I get them all fancied up.
This really isn't much of a question anymore, so if you find anything that really helps me with my above idea, you might receive the gloried Answer Checkmark for this "question.
I think the point @Kobi was making is that validity of a URL is distinct from presence of a resource at that URL. A valid URL may not point to a present resource. For example, the URL http://bclennox.com/there-is-no-page-at-this-address will return a 404, presumably failing your test even though it's a perfectly valid URL.
At any rate, if you're primarily interested in the HTTP status returned for a given URL, you can just issue an HTTP HEAD request rather than a normal GET. HEAD returns a much smaller payload (only the headers), which should speed up your requests considerably.
Here's an example using curl
:
$ curl -I http://bclennox.com
HTTP/1.1 200 OK
Date: Thu, 15 Mar 2012 03:14:59 GMT
Server: Apache
X-Powered-By: Phusion Passenger (mod_rails/mod_rack) 3.0.9, Enterprise Edition
ETag: "39cf7d1099a034de95dda297b18bfa2d"
X-UA-Compatible: IE=Edge,chrome=1
X-Rack-Cache: miss
X-Runtime: 0.139410
X-Request-Id: 50ce319e403ef4e6e468c2f4b9817691
Cache-Control: max-age=0, private, must-revalidate
Set-Cookie: _master_session=BAh7ByIQX2NzcmZfdG9rZW4iMWZhM0t1dTZiNjVWV1Q3YzlKVTZmdjRwK0FiWlpHUExVWXJnRlovd2R5aU09Ig9zZXNzaW9uX2lkIiU3YWEzZmNhYmYzYTQ2MDgwNTY5ZmU5MjhlNWU3ZDhmMA%3D%3D--c0f8c2bd6cccb1ff12f28da996dddbb50e448f1f; path=/; HttpOnly
Status: 200
Content-Type: text/html; charset=utf-8