Search code examples
.netpowershellindexinghashtable

Does there exist a designated (sub)index delimiter?


Background

It is quite common in PowerShell to build a hash table to quickly access objects by a specific property, e.g. to base an index on the LastName:

$List =  ConvertFrom-Csv @'
Id, LastName, FirstName, Country
 1, Aerts,    Ronald,    Belgium
 2, Berg,     Ashly,     Germany
 3, Cook,     James,     England
 4, Duval,    Frank,     France
 5, Lyberg,   Ash,       England
 6, Fischer,  Adam,      Germany
'@

$Index = @{}
$List |ForEach-Object { $Index[$_.LastName] = $_ }

$Index.Cook

Id LastName FirstName Country
-- -------- --------- -------
3  Cook     James     England

In some cases it is required to build the index on two (or even more) properties, e.g. the FirstName and the LastName. For this you might create a multi dimensional key, e.g.:

$Index = @{}
$List |ForEach-Object {
     $Index[$_.FirstName] = @{}
     $Index[$_.FirstName][$_.LastName] = $_
}

$Index.James.Cook

Id LastName FirstName Country
-- -------- --------- -------
3  Cook     James     England

But it is easier (and possibly even faster) to just concatenate the two properties. If only for checking for the existence of the entry: $Index.ContainsKey('James').ContainsKey('Cook') where an error might occur if the FirstName doesn't exist.
To join the properties, it is required to use a delimiter between the property otherwise different property lists might end up as the same key. As this example: AshlyBerg and AshLyberg.

$Index = @{}
$List |ForEach-Object { $Index["$($_.FirstName)`t$($_.LastName)"] = $_ }

$Index."James`tCook"

Id LastName FirstName Country
-- -------- --------- -------
3  Cook     James     England

Note: the above are Minimal, Reproducible Examples. In real life, I come several times to the questions below, which includes generally joining objects where the background - and number of properties used in the index are variable.

Questions:

  1. Is it a good practice to join (concatenate) properties for such a situation?
  2. If yes, is there a (standard?) delimiter for this? (meaning a character -or a sequence of characters- that should never be used/exist in a property name)

Solution

  • There is no built-in separator for multi-component hashtable (dictionary) keys.

    As for a custom separator: Your best bet for a character that is very unlikely to occur in the components themselves is NUL (the character with code point 0x0), which you can represent as "`0" in PowerShell. However, performing convention-based string operations on every lookup is awkward (e.g. $Index."James`0Cook") and generally only works if stringifying the key components is feasible - or if they're all strings to begin with, as in your example.

    Using arrays for multi-component keys is syntactically preferable, but using collections generally does not work as-is, because .NET reference types in general do not compare distinct instances meaningfully, even if they happen to represent the same data - see this answer.

    • Note: The following assumes that the elements of collections serving as keys do compare meaningfully (are themselves strings or .NET value types or .NET reference types with custom equality logic). If that assumption doesn't hold, there's no robust general solution, but a best-effort approach based on CLIXML serialization shown in the linked answer may work, which you yourself have proposed.

    zett42's helpful answer uses tuples, which do perform meaningful comparisons of distinct instances whose members contain equal data. However, the need to construct a tuple instance for each addition / modification / lookup is syntactically awkward (e.g.,
    $Index.([Tuple]::Create('James', 'Cook'))

    There is a way of making regular PowerShell arrays work as hastable keys, in a manner that only adds complexity to creating the hashtable (calling a constructor), while allowing regular array syntax for additions / updates and lookups (e.g., $Index.('James', 'Cook')).

    • Note: The following works equally with [ordered] hashtables, which, however, must be referred to by their true type name so as to be able to call a construct, namely [System.Collections.Specialized.OrderedDictionary].
      However, it does not work with generic dictionaries ([System.Collections.Generic.Dictionary[TKey, TValue]]).
    # Sample objects for the hashtable.
    $list =  ConvertFrom-Csv @'
    Id, LastName, FirstName, Country
     1, Aerts,    Ronald,    Belgium
     2, Berg,     Ashly,     Germany
     3, Cook,     James,     England
     4, Duval,    Frank,     France
     5, Lyberg,   Ash,       England
     6, Fischer,  Adam,      Germany
    '@
    
    # Initialize the hashtable with a structural equality comparer, i.e.
    # a comparer that compares the *elements* of the array and only returns $true
    # if *all* compare equal.
    # This relies on the fact that [System.Array] implements the
    # [System.Collections.IStructuralEquatable] interface.
    $dict = [hashtable]::new([Collections.StructuralComparisons]::StructuralEqualityComparer)
    
    # Add entries that map the combination of first name and last name 
    # to each object in $list.
    # Note the regular array syntax.
    $list.ForEach({ $dict.($_.FirstName, $_.LastName) = $_ })
    
    # Use regular array syntax for lookups too.
    # Note: CASE MATTERS
    $dict.('James', 'Cook')
    

    Important: The above performs case-SENSITIVE comparisons (as zett42's tuple solution does), unlike regular PowerShell hashtables.

    Making the comparisons case-INSENSITIVE requires more work, because a custom implementation of the [System.Collections.IEqualityComparer] interface is required, namely a case-insensitive implementation of what [System.Collections.StructuralComparisons]::StructuralEqualityComparer provides:

    # Case-insensitive IEqualityComparer implementation for 
    # use of arrays as dictionary keys.
    # Note: Dictionary keys cannot be $null, so there is no need for $null checks.
    class CaseInsensitiveArrayEqualityComparer: System.Collections.IEqualityComparer {
       [bool] Equals([object] $o1, [object] $o2) {
          return ([System.Collections.IStructuralEquatable] [array] $o1).Equals([array] $o2, [System.StringComparer]::InvariantCultureIgnoreCase)
       }
       [int] GetHashCode([object] $o) {
         return ([System.Collections.IStructuralEquatable] [array] $o).GetHashCode([StringComparer]::InvariantCultureIgnoreCase)
       }
    }
    
    # Pass an instance of the custom equality comparer to the constructor.
    $dict = [hashtable]::new([CaseInsensitiveArrayEqualityComparer]::new())
    

    Note:

    • Santiago Squarzon discovered ([System.Collections.IStructuralEquatable] $o).GetHashCode([StringComparer]::InvariantCultureIgnoreCase) as a built-in way to get a hash code for an array based on its elements' case-insensitive hash codes.

    • The original solutions below calculate the array's case-insensitive hash code element by element, which is both more cumbersome and less efficient. Perhaps they are still of interest in general with respect to how hash codes are calculated.


    Optional reading: element-by-element hash-code implementations:

    # Case-insensitive IEqualityComparer implementation for arrays.
    # See the bottom section of this answer for a better .NET 7+ alternative.
    class CaseInsensitiveArrayEqualityComparer: System.Collections.IEqualityComparer {
      [bool] Equals([object] $o1, [object] $o2) {
        return ([System.Collections.IStructuralEquatable] [array] $o1).Equals([array] $o2, [System.StringComparer]::InvariantCultureIgnoreCase)
      }
      [int] GetHashCode([object] $o) {
        if ($o -isnot [Array]) { return $o.GetHashCode() }
        [int] $hashCode = 0
        foreach ($el in $o) {
          if ($null -eq $el) { 
            continue
          } elseif ($el -is [string]) {
            $hashCode = $hashCode -bxor $el.ToLowerInvariant().GetHashCode()
          } else {
            $hashCode = $hashCode -bxor $el.GetHashCode()
          }
        }
        return $hashCode
      }
    }
    
    $list =  ConvertFrom-Csv @'
    Id, LastName, FirstName, Country
     1, Aerts,    Ronald,    Belgium
     2, Berg,     Ashly,     Germany
     3, Cook,     James,     England
     4, Duval,    Frank,     France
     5, Lyberg,   Ash,       England
     6, Fischer,  Adam,      Germany
    '@
    
    # Pass an instance of the custom equality comparer to the constructor.
    $dict = [hashtable]::new([CaseInsensitiveArrayEqualityComparer]::new())
    
    $list.ForEach({ $dict.($_.FirstName, $_.LastName) = $_ })
    
    # Now, case does NOT matter.
    $dict.('james', 'cook')
    

    A note on the .GetHashCode() implementation in the custom comparer class above:

    • A custom .GetHashCode() implementation is required to return the same hash code (an [int] value) for all objects that compare as equal (that is, if $o1 -eq $o2 is $true, $o1.GetHashCode() and $o2.GetHashCode() must return the same value).

    • While hash codes aren't required to be unique (and cannot be in all cases), ideally as few objects as possible share the same hash code, as that reduces the number of so-called collisions, which decreases the lookup efficiency of hash tables - see the relevant Wikipedia article for background information.

    • The implementation above uses a fairly simple, -bxor-based (bitwise XOR) algorithm, which results in the same hash code for two arrays that have the same elements, but in different order.

      • The .GetHashCode() help topic shows more sophisticated approaches, including using an auxiliary tuple instance, as its hash code algorithm is order-aware - while simple, this approach is computationally expensive, and more work is needed for better performance. See the bottom section for a .NET 7+ option.

    zett42's collision test code (adapted), which determines how many among 1000 arrays with a given number of elements that are random string values result in the same hash code, i.e. produce collisions, and calculates a collision percentage from that. If you need to improve the efficiency of the implementation above you can use this code to test it (possibly also to measure the tests' runtime to see how different implementations compare).

    # Create an instance of the custom comparer defined above.
    $cmp = [CaseInsensitiveArrayEqualityComparer]::new()
    
    $numArrays = 1000
    foreach ($elementCount in 2..5 + 10) {
    
      $numUniqueHashes = (
          1..$numArrays | 
          ForEach-Object { 
            $cmp.GetHashCode(@(1..$elementCount | ForEach-Object { "$(New-Guid)" })) 
          } |
          Sort-Object -Unique
        ).Count
    
      [pscustomobject] @{
        ElementCount = $elementCount
        CollisionPercentage = '{0:P2}' -f (($numArrays - $numUniqueHashes) / $numArrays)
      }
    
    }
    

    The about outputs 0% for all tests, so it seems that the -bxor approach is sufficient to prevent collisions, at least with random strings and without including variations of arrays that differ in element order only. Read on for a superior .NET 7+ solution.


    Superior custom equality comparer implementation in .NET 7+ (requires at least a preview version of PowerShell 7.3):

    zett42 points out that [HashCode]::Combine(), available in .NET 7+, allows for a more efficient implementation, as it:

    • is order-aware
    • allows determining a hash code for multiple values in a single operation.

    Note:

    • The method is limited to at most 8 array elements - but for multi-component that should be enough.

    • The values to combine - the array elements in this case - must be passed as individual arguments to the methods - passing the array as a whole doesn't work as intended. This makes the implementation somewhat cumbersome.

    # .NET 7+ / PowerShell 7.3+
    # Case-insensitive IEqualityComparer implementation for arrays
    # using [HashCode]::Combine() - limited to 8 elements.
    class CaseInsensitiveArrayEqualityComparer: System.Collections.IEqualityComparer {
      [bool] Equals([object] $o1, [object] $o2) {
        return ([System.Collections.IStructuralEquatable] [array] $o1).Equals([array] $o2, [System.StringComparer]::InvariantCultureIgnoreCase)
      }
      [int] GetHashCode([object] $o) {
        if ($o -isnot [Array] -or 0 -eq $o.Count) { return $o.GetHashCode() }
        $o = $o.ForEach({ $_ -is [string] ? $_.ToLowerInvariant() : $_ })
        $hashCode = switch ($o.Count) {
          1 { [HashCode]::Combine($o[0]) }
          2 { [HashCode]::Combine($o[0], $o[1]) }
          3 { [HashCode]::Combine($o[0], $o[1], $o[2]) }
          4 { [HashCode]::Combine($o[0], $o[1], $o[2], $o[3]) }
          5 { [HashCode]::Combine($o[0], $o[1], $o[2], $o[3], $o[4]) }
          6 { [HashCode]::Combine($o[0], $o[1], $o[2], $o[3], $o[4], $o[5]) }
          7 { [HashCode]::Combine($o[0], $o[1], $o[2], $o[3], $o[4], $o[5], $o[6]) }
          8 { [HashCode]::Combine($o[0], $o[1], $o[2], $o[3], $o[4], $o[5], $o[6], $o[7]) }
          default { throw 'Not implemented for more than 8 array elements.' }
        }
        return $hashCode
      }
    }
    

    However, as zett42 points out, you can overcame the value-count limit by calling [HashCode]::Combine() iteratively, in a loop.

    In the case of a case-insensitive implementation, that isn't too much overhead, given that you need a loop anyway, namely in order to call .ToLowerInvariant() on [string]-typed values (which is what the .ForEach() call above implicitly does).

    Here is his implementation:

    # .NET 7+ / PowerShell 7.3+
    # Case-insensitive IEqualityComparer implementation for arrays
    # using [HashCode]::Combine() *iteratively*, with *no* element-count limit.
    class CaseInsensitiveArrayEqualityComparer: System.Collections.IEqualityComparer {
      [bool] Equals([object] $o1, [object] $o2) {
        return ([System.Collections.IStructuralEquatable] [array] $o1).Equals([array] $o2, [System.StringComparer]::InvariantCultureIgnoreCase)
      }
      [int] GetHashCode([object] $o) {
        if ($o -isnot [Array] -or 0 -eq $o.Count) { return $o.GetHashCode() }
        
        $hashCode = 0
        $o.ForEach({ 
            $value = $_ -is [string] ? $_.ToLowerInvariant() : $_
            $hashCode = [HashCode]::Combine( $hashCode, $value )
        })
    
        return $hashCode
      }
    }