I am trying to compare 2 urls such as http://www.example.com
and http://example.com
with the expectation the output result would be that they are similar.
I am using the Uri class to represent the urls and Uri.Compare to compare the specified parts of the two Uris using the specified comparison rules.
Uri uri1 = new Uri("http://www.example.com");
Uri uri2 = new Uri("http://example.com");
var result =
Uri.Compare(
uri1,
uri2,
UriComponents.NormalizedHost | UriComponents.PathAndQuery,
UriFormat.SafeUnescaped, StringComparison.OrdinalIgnoreCase
);
Console.WriteLine (result);
I have tested with both UriComponents.Host
and UriComponents.NormalizedHost
which match the Host data and the Normalized Host respectively, but they both seem to compare the entire domain name of the urls (www.example.com is compared to example.com) in this situation. See UriComponents Enum for more information.
Is there a way to compare these (urls with www and no www) using Uri.Compare? If not what other solution would be suitable for such a case?
Before providing an answer, I want to acknowledge @Marc Gravell's point from the comments:
www.example.com
andexample.com
are different hosts; one is an "ANAME", the other is a "CNAME" - they are allowed to resolve to different IP addresses and serve completely different content…
Obviously, in practice, we would never expect this between www
and the root domain, per common convention. But Marc's point is relevant because that's why you may not find native support for this in Uri.Compare()
.
Of course, in absence of sufficient documentation, you'd be forgiven for thinking that UriComponents.NormalizedHost
might do this, so I applaud your effort. After all, removing (or adding) the www
is within the bounds of URL normalization:
Removing or adding “www” as the first domain label. Some websites operate identically in two Internet domains: one whose least significant label is “www” and another… [being the] naked domain. For example,
http://www.example.com/
andhttp://example.com/
may access the same website. Many websites redirect the user from the www to the non-www address or vice versa. A normalizer may determine if one of these URIs redirects to the other and normalize all URIs appropriately.
I want to highlight that last line, though:
A normalizer may determine if one of these URIs redirects to the other and normalize all URIs appropriately
Critically, this should not be assumed in URL normalization. Returning to @Marc's point, even a first-order validation of this would require an expensive DNS lookup of the hostnames, and that wouldn't be sufficient in guaranteeing this in the general case, even if it's highly unexpected in this specific case. And, thus returning to @Marc's comment:
…unfortunately, "similar" isn't a thing in equality / inequality operations: there is just equal, non equal (for equality) and "less", "more", "equivalent sorting" for inequality; if you want to define your own rules where
www.x.com
is the same asx.com
, great! but: that's not a reliable thing
So what to do? Unfortunately, while @Fadam's answer lacks sufficient detail to answer your question, their point you may need to fall back to string comparison stands.
Unfortunately, this task is made more difficult since the Uri
class doesn't offer a Normalize()
method to return a normalized URI. That said, the ToString()
method at least returns the canonical version of the URL, and not the original string used to construct the Uri
object. As such, depending on what assumptions you're confident making in your data, you may be able to do something as simple as:
url1.ToString().Replace("www.", "", StringComparison.InvariantCultureIgnoreCase) == url2.ToString().Replace("www.", "", StringComparison.InvariantCultureIgnoreCase)
That said, what I'd recommend is simply to normalize your data going in:
Uri uri1 = new Uri("http://www.example.com".Replace("www.", "", StringComparison.InvariantCultureIgnoreCase));
Uri uri2 = new Uri("http://example.com".Replace("www.", "", StringComparison.InvariantCultureIgnoreCase));
(Acknowledging that, in practice, these would be variables from your user input—otherwise we obviously wouldn't require the Replace()
.)
And then simply continue to use the String.Compare()
code you already proposed:
Uri.Compare(
uri1,
uri2,
UriComponents.NormalizedHost | UriComponents.PathAndQuery,
UriFormat.SafeUnescaped, StringComparison.OrdinalIgnoreCase
);
Warning: If you do this, however, you'll want to to be cautious not to use the trimmed URL outside of this comparison. For instance, a domain may only have an SSL certificate on www
, so if a user provides https://www.example.com
, but you persist the value https://example.com
, that may not function.
I realize this isn't the answer you were hoping for, and it's not a very satisfying solution, but it's likely the easiest practical and accessible approach.
Of course, depending on the reliability of your data, you could end up going down quite a rabbit hole trying to address all of the possible caveats here—e.g., if, beyond the hostname, you're also concerned about other variations, such as trailing slashes, character encoding, query string order, fragments, &c. Using the Uri.Compare()
approach with UriComponents.NormalizedHost
will address all of that, but in absence of clear documentation, you'll need to evaluate what all it is normalizing.