I am developing a C# app that gets web pages and processes their contents line by line. To do this, I use the HttpClient
class, and read the page contents through ReadAsStreamAsync()
. Then I read the stream into a line array and iterate over it. So far so good.
However, the HTML that I obtain with this method is not identical to the HTML that I observe if I navigate to the web page using Chrome or Edge and use View Source to get to the HTML. In particular, the __VIEWSTATE and __VIEWSTATEGENERATOR hidden input
elements are surrounded by div
elements with class="aspNetHidden"
when I use the browser, but not when I get the HTML programmatically. This ruins my line tracking logic as there are extra lines in the page as seen by the browser in relation to the page I am getting in code.
EDIT. After some testing, I am confident that the user agent header employed by the client is what determines whether or not the class="aspNetHidden"
div
is served. When I mimic my browser's user agent ("Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.61 Safari/537.36 Edg/83.0.478.37"), the div
is served; if I use some other agent such as "Test Client", the div
is not served.
My question then is, is there any documentation on what user agent strings cause the div
to be served and which don't? Also, can I prevent this from happening?
Thanks.
In short, it is not documented/specified in terms of useragents, but browser capabilities.
Based on the browsers useragent a set of capabilities gets set up.
These capabilities are configured in .browser
configuration files on the webserver.
For e.g. .NET 4
you find these files in %SystemRoot%\Microsoft.NET\Framework\v4.0.30319\config\browsers
,
e.g. chrome.browser
, iphone.browser
, etc.
Such a .browser
file contains a tagwriter
capability.
E.g. chrome.browser
:
<browsers>
<!-- Mozilla/5.0 (Windows; U; Windows NT 6.0; en-US) AppleWebKit/530.1 (KHTML, like Gecko) Chrome/2.0.168.0 Safari/530.1 -->
<browser id="Chrome" parentID="WebKit">
<identification>
<userAgent match="Chrome/(?'version'(?'major'\d+)(\.(?'minor'\d+)?)\w*)" />
</identification>
<capabilities>
<capability name="browser" value="Chrome" />
<capability name="tagwriter" value="System.Web.UI.HtmlTextWriter" />
<!-- ... -->
</capabilities>
</browser>
</browsers>
The tagwriter
capability specifies whether a System.Web.UI.HtmlTextWriter
or a System.Web.UI.Html32TextWriter
will be be instantiated to write the output.
The default configuration in the Default.browser
file, declares tagwriter
as:
<capability name="tagwriter" value="System.Web.UI.Html32TextWriter" />
Also, if the tagwriter
capability is missing a Html32TextWriter
is being used.
From the Microsoft reference source:
internal HtmlTextWriter CreateHtmlTextWriterInternal(TextWriter tw) {
Type tagWriter = TagWriter;
if (tagWriter != null) {
return Page.CreateHtmlTextWriterFromType(tw, tagWriter);
}
// Fall back to Html 3.2
return new Html32TextWriter(tw);
}
The Html32TextWriter
declares not to render a div
around hidden input fields.
From the Microsoft reference source:
internal override bool RenderDivAroundHiddenInputs {
get {
return false;
}
}
The HtmlTextWriter
does return true
for RenderDivAroundHiddenInputs
,
see the Microsoft reference source.
Some more reading about all this here.
What you can do.
If you always want the wrapping div
, use one of the wellknown useragents, otherwise use a custom one like the Test Client
you are already using.
If you control the website being requested, you can set up a custom .browser
file for your custom useragent ... but I would rather not go that way ...
When making the request, just set the appropriate User-Agent
request header on your HttpClient
, e.g.:
var client = new HttpClient();
var userAgent = "Test Client"; // Or "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.61 Safari/537.36 Edg/83.0.478.37"
client.DefaultRequestHeaders.Add("User-Agent", userAgent);