Search code examples
url.htaccessurl-routingurl-encoding

Unable to allow these characters in URL:- % / \ # +


About the system

I have URLs of this format in my project:-

http://project_name/browse_by_exam/type/tutor_search/keyword/class/new_search/1/search_exam/0/search_subject/0

Where keyword/class pair means search with "class" keyword.

Following is my htaccess file:-

##AddHandler application/x-httpd-php5 .php

Options Includes +ExecCGI
Options +FollowSymLinks

<IfModule mod_rewrite.c>
RewriteEngine on

############To remove index.php from URL

RewriteCond $1 !^(index\.php|resources|robots\.txt)
RewriteCond %{REQUEST_FILENAME} !-f
RewriteCond %{REQUEST_FILENAME} !-d
RewriteRule ^(.*)$ index.php/$1 [L,QSA]
#################################################end of find a class 


</IfModule>

I have a common index.php file which executes for every module in the project. There is only a rewrite rule to remove the index.php from URL (as you can see above).

I am not using any htaccess rewrite rules for defining the $_GET array. I have a URL parser function in PHP inside which does that instead. For the example URL I gave, the parser returns:-

Array ( [a] => browse_by_exam [type] => tutor_search [keyword] => class [new_search] => 1 [search_exam] => 0 [search_subject] => 0 )

I am using urlencode() while preparing the search URL and urldecode() while reading the search URL

Problem

I am facing problems with some characters in the URL

Character               Response
%                       400 - Bad Request - Your browser sent a request that this server could not understand.
/                       404 - Not FOund
\ # +                   Page does not break but urldecode() removes these characters.

I want to allow all these characters. What could be the problem? How do I allow these? Please help Thanks, Sandeepan

Updates

Now only / character is causing URL breaking (404 error like before). So, I tried by removing the htaccess rewrite rule which hides the index.php in the URL and tried with complete URL instead. For a search term class/new I tried with the following two URLs:-

http://project_name/index.php?browse_by_exam/type/tutor_search/keyword/class%2Fnew/new_search/1/search_exam/0/search_subject/0

http://project_name/index.php/browse_by_exam/type/tutor_search/keyword/class%2Fnew/new_search/1/search_exam/0/search_subject/0

And the first one works but the 2nd one does not. Notice the index.php?browse_by_exam in the first one.

But I cant use the 1st URL convention. I have to make / work with index.php hidden. Please help

Thanks again Sandeepan

Edit (Solved)

Considering Bobince's answer to my other question

urlencoded Forward slash is breaking URL , I feel it is best to have URLs like this:- http://project_name/browse_by_exam?type/tutor_search/keyword/class %2Fnew/new_search/1/search_exam/0/search_subject/0

That way I get rid of the difficulty of readability caused by &param1=value1&param2=value2 convention and also able to allow forward slashes in the query string part by using ?

I want to avoid AllowEncodedSlashes because Bobince said Also some tools or spiders might get confused by it. Although %2F to mean / in a path part is correct as per the standard, most of the web avoids it.


Solution

  • Some of the issues sound like they are related to you trying to use PATH_INFO (your RewriteRule sticks everything behind index.php as if it were a path). Would it be possible to just use the $_SERVER['REQUEST_URI'] variable as the input to your URL parser function instead? It contains the same information, and I feel it would be less problematic.

    Attempting to create a PATH_INFO solution doesn't seem to work very well in a per-dir (.htaccess) context. You can set AllowPathInfo On, but once mod_rewrite attempts to redirect the URL internally, it seems like Apache doesn't want to parse out the trailing part of the URL, which results in the 404 error.

    If you use $_SERVER['REQUEST_URI'] instead, then you can just rewrite to index.php without the trailing information, like so:

    RewriteCond $1 !^(index\.php|resources|robots\.txt)
    RewriteCond %{REQUEST_FILENAME} !-f
    RewriteCond %{REQUEST_FILENAME} !-d
    RewriteRule ^(.*)$ index.php [L,QSA]
    

    As far as the 400 error goes, your % should be encoded as %25 by urlencode(), but it sounds like for whatever reason there might be an issue. I'd check to make sure that your search URLs are indeed being properly encoded in the output sent to the browser, as this may be related to issues with the other remaining characters as well (but I'm not sure).

    Edit: If you used the rerwite above, you'd have URLs like

    http://project_name/browse_by_exam/type/tutor_search/keyword/class/new_search/1/search_exam/0/search_subject/0
    

    and they would be internally redirected to index.php. Then, you could get the part

    /browse_by_exam/type/tutor_search/keyword/class/new_search/1/search_exam/0/search_subject/0
    

    from $_SERVER['REQUEST_URI'] in that script (it would contain this value) which you could then parse like you're doing now. I'm not sure why you have to be able to have it rewritten after the index.php, since you can get this information even if it isn't, and it looks the exact same to the user in their browser. You could even do this at the beginning of the script, if the part that uses $_SERVER['PATH_INFO'] is not available for changing:

    $_SERVER['PATH_INFO'] = $_SERVER['REQUEST_URI'];
    

    If you really can't do it like this, I'm not sure that there is a solution (there was an explanation in your other question on why this is problematic), but I'll look to see if it's at all possible and get back to you.