Search code examples
phpjavascriptregexurlpagespeed

URL Regex Matcher (Idea)


Would it be a lot easier to make a simple regex for finding urls, then have another script to see if that site sends back data or not? I've always wondered if this would be a faster, and easier solution than taking years to develop the "perfect" url detecting regex, just to have it crushed a few days later.

If anyone can find speed tests for a basic page access/load, please post it here to help answer my question.

Also, how hard would it be on the server to constantly make requests such as this, say... 100 times an hour?

I am going to test this out with Javascript, using /(http|www\.)\S+/gim as the regex and a 60 second timeout connection to the requested url. I will do a simple "Title Grab" from the url, then record how long the trial takes. I'll post the speeds once I get them all fancied up.

This really isn't much of a question anymore, so if you find anything that really helps me with my above idea, you might receive the gloried Answer Checkmark for this "question.


Solution

  • I think the point @Kobi was making is that validity of a URL is distinct from presence of a resource at that URL. A valid URL may not point to a present resource. For example, the URL http://bclennox.com/there-is-no-page-at-this-address will return a 404, presumably failing your test even though it's a perfectly valid URL.

    At any rate, if you're primarily interested in the HTTP status returned for a given URL, you can just issue an HTTP HEAD request rather than a normal GET. HEAD returns a much smaller payload (only the headers), which should speed up your requests considerably.

    Here's an example using curl:

    $ curl -I http://bclennox.com
    HTTP/1.1 200 OK
    Date: Thu, 15 Mar 2012 03:14:59 GMT
    Server: Apache
    X-Powered-By: Phusion Passenger (mod_rails/mod_rack) 3.0.9, Enterprise Edition
    ETag: "39cf7d1099a034de95dda297b18bfa2d"
    X-UA-Compatible: IE=Edge,chrome=1
    X-Rack-Cache: miss
    X-Runtime: 0.139410
    X-Request-Id: 50ce319e403ef4e6e468c2f4b9817691
    Cache-Control: max-age=0, private, must-revalidate
    Set-Cookie: _master_session=BAh7ByIQX2NzcmZfdG9rZW4iMWZhM0t1dTZiNjVWV1Q3YzlKVTZmdjRwK0FiWlpHUExVWXJnRlovd2R5aU09Ig9zZXNzaW9uX2lkIiU3YWEzZmNhYmYzYTQ2MDgwNTY5ZmU5MjhlNWU3ZDhmMA%3D%3D--c0f8c2bd6cccb1ff12f28da996dddbb50e448f1f; path=/; HttpOnly
    Status: 200
    Content-Type: text/html; charset=utf-8