I'm developing a windows desktop application with c#/.net and want to add a feature to open windows explorer and search a query in the computer from the application.
I plan to use Windows search protocol to implement it. Below is my code snippet. The rawQuery is passed from my application to the windows explorer search box.
var query = "&query=" + HttpUtility.UrlEncode(rawQuery);
var location = string.Empty;
foreach (var drive in DriveInfo.GetDrives().Where(d => d.IsReady && d.DriveType.Equals(DriveType.Fixed)))
{
location += "&crumb=location:" + HttpUtility.UrlEncode(drive.Name);
}
var searchQuery = "search:displayname=Search computer" + query + location;
Process.Start(searchQuery);
Above code has an issue. If the rawQuery has non English character, it is shown incorrectly in windows explorer search box after it's encoded(HttpUtility.UrlEncode()). For example, if rawQuery is Chinese, like "微软", it searches 微软 in windows explorer. It's bad.
However, if rawQuery is not encoded, special characters, like &, %, etc., cannot be shown in windows explorer search box.
So I'm not sure how to determine if the character should be encoded or not. I did not find any documentation about that in search protocol spec.
Does anybody know which characters should be encoded?
It seems indeed there is no documentation about what exactly should be url-encoded in search query, but we can make an educated guess.
First, how HttpUtility.UrlEncode
encodes unicode characters? According to RFC 3986 such characters should first be represented as UTF-8 bytes, then those bytes should be pecent-encoded. That's just what HttpUtility.UrlEncode
does. For your string:
var encoded = HttpUtility.UrlEncode(rawQuery); // = %e5%be%ae%e8%bd%af
2 characters are represented with 6 bytes, 3 bytes for each. It is decoded as 微软
- 6 characters. So it's clear that search query decoder does NOT expect UTF-8 characters encoding. Which encoding it expects? You can find this with little experiments - it's ISO-8859-1 encoding. You can verify your particular case with this code:
var rawQuery = "微软";
var encoded = HttpUtility.UrlEncode(rawQuery);
var iso = Encoding.GetEncoding("iso-8859-1");
var decoded = HttpUtility.UrlDecode(encoded, iso); // outputs "微软"
So we can conclude that encoding anything outside ISO-8859-1 makes no sense and will produce invalid results, because those characters just cannot be represented in this encoding (it's just 8-bit one).
What should be encoded inside that set? Anything above ASCII (so characters 128-256) can be passed without encoding. This is against the RFC of course, but we already know that search protocol does not follow it anyway, because it allows UTF-8 characters without encoding. You can encode characters like ¢ (162 in ISO-8859-1) as %A2 if you want to be completely on the safe side, and it will work, but it will also work without.
Now we need to encode ASCII characters that are reserved for special use in different parts of url or not allowed there at all (unescaped), or treated as "may cause problems when use unescaped". RFC says that such characters are:
control = <US-ASCII coded characters 00-1F and 7F hexadecimal>
space = <US-ASCII coded character 20 hexadecimal>
delims = "<" | ">" | "#" | "%" | <">
unwise = "{" | "}" | "|" | "\" | "^" | "[" | "]" | "`"
reserved = ";" | "/" | "?" | ":" | "@" | "&" | "=" | "+" | "$" | ","
Now, not all of those characters need to be encoded in this particular case, and most of them will work unencoded, but again if you want to be on a safe side - you can just encode them all, or figure that out by trial and error (characters like "&", "%", "/" obviously must be encoded anyway).