I'm writing my own little php framework. I want to write everything as semantic as it could be, and I'm stacked.
I've got an url
parsing class
. It parse the whole url (scheme, subdomain, domain, resource and query). Next the router
class decides what to do with this url
. If there are resources corresponding to url
it "renders" it, if not it render 404, if resource is forbidden it renders 403, etc... What is the problem:
Let's say that my site is under: http://en.mysite.com
. Lets say that pages asd
and &*%
does not exist. So I've got 2 url's:
http://en.mysite.com/asd
http://en.mysite.com/&*%($^&#
Of course both sites doesn't exists. But what should the headers look like? I'm predicting that:
http://en.mysite.com/asd // header 404 Page not found
http://en.mysite.com/&*% // header 400 Bad request
However (based on our guru site):
http://stackoverflow.com/<< // header 404
http://stackoverflow.com/&;: // header 404
http://stackoverflow.com/&*%($%5E&# // header 400 (which btw is not styled...)
https://www.google.com/%&*(#$*%&@^ // header 404...
What is the rule? Should every system predict which symbols are ok for url? As for me url should containt only [a-z0-9-_.#!]+
. I'm using slashes as paramters, so I dont need ? = &
. But what is the general rule? Are there any url regex in specification?
BTW: For those who will say put 404 and go drink bear: I probably will :).
But this problem is kind of serious in case of SEO. As 400 is quite not the same as 404 in case of positioning. And it is nice to style 400 page Your own way, and say to someone not "page not found" but "are you trying to inject something into my beautiful url? It is a BAD REQUEST!
As far as I can tell from the IETF RFC2616, 400 should be returned for requests that are mallformed (i.e. do not conform to the IETF RFC3986, whereas 404 should be returned for resources that do not exist (410 should be returned for resources that once existed but have now gone).
In the above examples URL's with a %-sign not followed by two hexadecimal characters are definitely mallformed (e.g. en.mysite.com/&%($^&#
and www.google.com/%&(#$*%&@^
). Also malformed are queries that have two ?
(question mark signs) in the last part.
A regular expression for URLs can be found in response to the question: PHP validation/regex for URL.