Search code examples
c#powershellpolymorphismhtml-agility-packgeneric-method

Which version of GetAttributeValue of the 'HTML Agility Pack' is used when calling from PowerShell with the second parameter $null?


I am writing a PowerShell script to work in Windows 10. I am using the 'HTML Agility Pack' library version 1.11.43.

In this library, there is a GetAttributeValue method for HTML element nodes in four versions:

  1. public string GetAttributeValue(string name, string def)
  2. public int GetAttributeValue(string name, int def)
  3. public bool GetAttributeValue(string name, bool def)
  4. public T GetAttributeValue<T>(string name, T def)

I have written a test script for these methods on PowerShell:

$libPath = "HtmlAgilityPack.1.11.43\lib\netstandard2.0\HtmlAgilityPack.dll"
Add-Type -Path $libPath
$dom = New-Object -TypeName "HtmlAgilityPack.HtmlDocument"
$dom.Load("test.html", [System.Text.Encoding]::UTF8)

foreach ($node in $dom.DocumentNode.DescendantNodes()) {
    if ("#text" -ne $node.Name) {
        $node.OuterHTML
        "    " + $node.GetAttributeValue("class", "")
        "    " + $node.GetAttributeValue("class", 0)
        "    " + $node.GetAttributeValue("class", $true)
        "    " + $node.GetAttributeValue("class", $false)
        "    " + $node.GetAttributeValue("class", $null)
    }
}

File 'test.html':

<p class="true"></p>
<p class="false"></p>
<p></p>
<p class="any other text"></p>

Test script execution result:

<p class="true"></p>
    true
    0
    True
    True
    True
<p class="false"></p>
    false
    0
    False
    False
    False
<p></p>

    0
    True
    False
    False
<p class="any other text"></p>
    any other text
    0
    True
    False
    False

I know that to get the attribute value of an HTML element, you can also use the expression $node.Attributes["class"]. I also understand what polymorphism and method overloading are. I also know what a generic method is. I don't need to explain that.

I have three questions:

  1. When called $node.GetAttributeValue("class", $null) from a PowerShell script, which of the four variants of the GetAttributeValue method works?

  2. I think the fourth option works (generic method). Then why does a call with the second parameter $null work exactly the same as a call with the second parameter $false?

  3. In the C# source code, the fourth option requires the following condition to work

#if !(METRO || NETSTANDARD1_3 || NETSTANDARD1_6)

I tried the library versions for NETSTANDARD1_6 and for NETSTANDARD2_0. The test script works the same way. But with NETSTANDARD1_6 the fourth option should be unavailable, right? Then when NETSTANDARD1_6 then which version of the method GetAttributeValue works with the second parameter $null?


Solution

  • tl;dr

    To achieve what you unsuccessfully attempted with
    $node.GetAttributeValue("class", $null), i.e., to return the attribute value as a [string] and default to $null if there is none, use:

    $node.GetAttributeValue("class", [string] [NullString]::Value)
    

    [string] $null works too, but makes "" (the empty string) rather than $null the default value.


    While the overload resolution that you're seeing is surprising, you can resolve ambiguity during PowerShell's method overload resolution with casts:

    $dom = [HtmlAgilityPack.HtmlDocument]::new()
    $dom.LoadHtml(@'
    <p class="true"></p>
    <p class=42></p>
    <p></p>
    <p class="any other text"></p>
    '@)
    
    $nodes = $dom.DocumentNode.SelectNodes('p')
    
    # Note the use of explicit casts (e.g., [string]) to guide overload resolution.
    $nodes[0].GetAttributeValue('class', [bool] $false)
    $nodes[1].GetAttributeValue('class', [int] 0)
    $nodes[2].GetAttributeValue('class', [string] 'default')
    $nodes[3].GetAttributeValue('class', [string] [NullString]::Value)
    

    Output:

    True
    42
    default
    any other text
    

    Alternatively, in PowerShell (Core) 7.3+[1], you can now call generic methods with explicit type arguments:

    # PS 7.3+
    # Note the generic type argument directly after the method  name.
    # Calls the one and only generic overload, with various types substituted for T:
    #   public T GetAttributeValue<T>(string name, T def)
    # Note how the 2nd argument doesn't need a cast anymore.
    $nodes[0].GetAttributeValue[bool]('class',  $false)
    $nodes[1].GetAttributeValue[int]('class', 0)
    $nodes[2].GetAttributeValue[string]('class', 'default')
    $nodes[3].GetAttributeValue[string]('class', [NullString]::Value)
    

    Note:

    • When you pass $null to a [string] typed parameter (both in cmdlets and .NET methods), PowerShell actually converts it quietly to "" (the empty string). [NullString]::Value tell's PowerShell to pass a true null instead, and is mostly needed for calling .NET methods where a behavioral distinction can result from passing null vs. "".

    • Therefore, if you were to call $nodes[3].GetAttributeValue('class', [string] $null) or, in PS 7.3+, $nodes[3].GetAttributeValue[string]('class', $null), you'd get "" (empty string) as the default value if attribute class doesn't exist.

    • By contrast, [NullString]::Value, as used in the commands above, causes a true $null value to be returned if the attribute doesn't exist; you can test for that with $null -eq ....


    As for your questions:

    On a general note, PowerShell's overload resolution is complex, and for the ultimate source of truth you'll have to consult the source code. The following is based on the de-facto behavior as of PowerShell 7.2.6 and musings about logic that could be applied.

    When calling $node.GetAttributeValue("class", $null) from a PowerShell script, which of the four variants of the GetAttributeValue method works?

    In practice, the public bool GetAttributeValue(string name, bool def) overload is chosen; why it, specifically, is chosen among the available overloads is ultimately immaterial, because the fundamental problem is that to PowerShell, $null provides insufficient information as to the type it may be a stand-in for, so it cannot generally be expected to select a specific overload (for the latter, you need a cast, as shown at the top):

    • In C# passing null to the second parameter in a non-generic call unambiguously implies the overload with the string-typed def parameter, because among the non-generic overloads, string as the type of the def parameter is the only .NET reference type, and therefore the only type that can directly accept a null argument.

    • This is not true in PowerShell, which has much more flexible, implicit type-conversion rules: from PowerShell's perspective, $null can bind to any of the types among the def parameters, because it allows $null to be converted to those types; specifically, [bool] $null yields $false, [int] $null yields 0, and - perhaps surprisingly, as discussed above - [string] $null yields "" (the empty string).

      • Thus, PowerShell is justified in selecting any one of the non-generic overloads in this case, and which one it chooses should be considered an implementation detail.

    However, curiously, even using [NullString]::Value doesn't make a difference, even though PowerShell should know that this special value represents a $null value for a string parameter - see GitHub issue #18072


    I think the fourth option works (generic method). Then why does a call with the second parameter $null work exactly the same as a call with the second parameter $false?

    With the generic invocation syntax available in v7.3+, the generic overload definitely works - and a $null as the default-value argument is converted to the type specified as the type argument (assuming PowerShell allows such a conversion; it wouldn't work with [datetime], for instance, because [datetime] $null causes an error).

    Even with the non-generic syntax, PowerShell does select the generic overload by inference, as the following example shows, but only when you pass an actual object rather than $null:

    # Try to retrieve a non-existent attribute and provide a [double]
    # default value.
    # The fact that a [double] instance is returned implies that the
    # generic overload was chosen.
    #  -> 'System.Double'
    $nodes[0].GetAttributeValue('nosuch', [double] $null).GetType().FullName
    

    In the C# source code, the fourth option requires the following condition to work [...]

    When you pass $null, the generic overload is not considered - and cannot be, in the absence of type information - so this doesn't make a difference.


    [1] As of this writing, v7.3 hasn't been released yet, but preview versions are available - see the repo.