PowerShell Select-Object: Using -Unique with First/Last/Skip/Index

I'm just curious if I'm missing any documentation, or if there is a different/better way to do this that negates the need for documentation. Maybe I'm the only one trying to use Select-Object to select the -First X unique instances from a set of data.

Based on the testing below, it looks like using Select-Object with the -Unique switch and some type of limiter (First, Last, Skip, Index, etc.) inherently causes the limiter to be applied BEFORE removing duplicates. This doesn't make sense to me conceptually, but also doesn't appear to be documented.

I apologize for the poor example, but consider an array of 20 items with each item appearing twice:

PS > $array = @() ; 1..10 | % { $array += $_ ; $array += $_ }
PS > $array -Join ','
1,1,2,2,3,3,4,4,5,5,6,6,7,7,8,8,9,9,10,10  ##Displaying the array on a single comma separated line

Let's say that someone gives you $array, but you can only handle a maximum input of 5 objects. Filtering down what you're given, you might be tempted to use Select-Object. At first you end up with 5 objects, but there are duplicates, so quick thinking you simply add the -Unique switch and then you realize that the output still isn't quite right.

PS > ($array | Select-Object -First 5) -Join ','
1,1,2,2,3  ##5 objects as expected, but with duplicates
PS > ($array | Select-Object -Unique -First 5) -Join ','
1,2,3  ##No duplicates, but less than the expected 5 objects...

To get the outcome I was expecting, I'd need Select-Object to remove the duplicates prior to returning the final set of objects. While there is nothing wrong in knowing this, it seems strange to me that the Select-Object uses the order of operations that it does and also that there isn't any documentation around the fact that the -Unique switch is applied at the end of the cmdlet.

PS > ($array | Select-Object -Unique | Select-Object -First 5) -Join ','
1,2,3,4,5  ##This is my expected outcome, 5 objects returned without any duplicates

Solution

Indeed, the -First / -Last / -Skip / -Index / -SkipIndex / -SkipLast parameters apply to the original input first, and -Unique is applied to the resulting output.

The simple workaround is to use two Select-Object calls: one that finds the unique objects, and another that selects the desired number from among the unique ones:

PS> 1, 1, 2, 3 | Select-Object -Unique | Select-Object -First 2
1
2

Given that Select-Object -Unique is excessively slow as of PowerShell 7.2 (see bottom section), here is a faster workaround, as you've discovered yourself: Use an aux. System.Collections.Generic.HashSet`1 instance combined with ForEach-Object; the example also shows support for case-insensitivity, which Select-Object -Unique currently lacks (see bottom section):

# Create an aux. hash set that keeps tracks of what objects have
# already been seen, using case-*insensitive* comparisons.
$auxHashSet = [Collections.Generic.HashSet[string]]::new(
                [StringComparer]::InvariantCultureIgnoreCase
              )

# Stream to ForEach-Object, where the aux. hash set is used
# to only pass out objects that haven't previously been seen.
'a', 'A', 'B', 'c' |
  ForEach-Object { if ($auxHashSet.Add($_)) { $_ } } |
    Select-Object -First 2

This outputs 'a', 'B', as desired. Note that you may want to remove $auxHashSet variable so as to (eventually) free its memory - see next.

Using a -Begin block with ForEach-Object, you can make the pipeline more self-contained, but note that all script blocks run directly in the caller's scope, so that $auxHashSet is still created there and would live on after the command, so you'll still have to manually remove it and thereby (eventually) release its memory.

Note: While in principle you could do that in an -End block, this does not work with Select-Object -First, because the premature stopping of the pipeline does not give upstream cmdlets a chance to run their end blocks - see GitHub issue #7930 for a discussion of this surprising behavior.

'a', 'A', 'B', 'c' |
  ForEach-Object -Begin { 
    $auxHashSet = [Collections.Generic.HashSet[string]]::new([StringComparer]::InvariantCultureIgnoreCase) 
  } -Process {
    if ($auxHashSet.Add($_)) { $_ } 
  } |
    Select-Object -First 2
# Remove the aux. variable and (eventually) free its memory.
Remove-Variable auxHashSet

Note that there's also a LINQ-based alternative, via [System.Linq.Enumerable]::Distinct(), but it has important constraints:

The output is unordered i.e. the input order is not guaranteed to be preserved.
You cannot stream the method's input collection from a PowerShell command (to pass a PowerShell command's output to a method, it must be collected in full in an array, up front) - however, the output from LINQ methods such as Distinct() is effectively streaming, due to returning a lazy enumerable.^[1]
Additionally, the input array must be strongly typed, if it isn't already. PowerShell makes this easy with a cast such as [int[]], but note that with an [object[]]-based array as input (which is what regular PowerShell arrays are, such as used for collection command output), but do note that this involves creating a copy of the array, which with large input collections can by itself take a while.

[Linq.Enumerable]::Distinct(
  [string[]] ('a', 'A', 'B', 'c'), 
  [StringComparer]::InvariantCultureIgnoreCase
) | Select-Object -First 2

This too outputs 'a', 'B' (though the order of the output elements isn't guaranteed).

If the constraints aren't a concern and you need to find the unique elements in the whole input collection (or a large part of it), this solution is considerably faster than the hash-set-assisted ForEach-Object solution, especially if your input collection is already strongly typed.

If, within the same constraints, you don't care about the lazy output behavior and just want to get an in-memory collection of all distinct objects - again, unordered - you can use a System.Collections.Generic.HashSet`1 instance directly:

[Collections.Generic.HashSet[string]]::new(
  [string[]] ('a', 'A', 'B', 'c'), 
  [System.StringComparer]::InvariantCultureIgnoreCase
)

This outputs 'a', 'B', 'c', but notably as a hash-set object, not an array, but, due to being enumerable, it'll behave like an array in PowerShell's enumeration contexts, notably in the pipeline.

`Select-Object -Unique` pitfalls, contrast with `Sort-Object`:

While the extra Select-Object call does add processing overhead, the command overall has the potential to only processes only as many input objects as needed, i.e. to stop processing once the desired number of unique objects have been found.
However, as of PowerShell 7.2, it seems that Select-Object -Unique is implemented inefficiently and unexpectedly collects all input first before producing output, even though there's no conceptual reason to do so: it should be able to produce streaming output, i.e. to - conditionally - output input objects as they're being received, because it only needs to consider what input objects have been received so far.
- In practice, as of as of PowerShell 7.2, Select-Object -Unique is excessively slow with larger input collections; the current, problematic implementation is discussed in GitHub issues #11221 and #7707.
- This conceptual ability to only consider input received so far contrasts with Sort-Object, which also offers a -Unique switch, but of necessity must collect all input first before producing output, because all input objects must be considered for proper sorting.
  - As of PowerShell 7.2, Sort-Object -Unique is much faster in practice than Select-Object -Unique.
- As for how Select-Object -Unique could be implemented in a more efficient, streaming manner: The objects seen so far could be stored in a System.Collections.Generic.HashSet`1 instance to facilitate an efficient test for whether an input object is considered equal to one that has already been output; see this answer for a PowerShell example.
If and when Select-Object -Unique is fixed, the tradeoff is as follows:
- The smaller the proportion of the output objects of interest is to in relation to all input objects, the better off you are using Select-Object -Unique (even if you have to sort the resulting objects afterwards).
- If you need to output / consider all input objects anyway, and assuming that outputting the objects of interest in sort order is desired / acceptable, Sort-Object is the better choice.
As of PowerShell 7.2, Select-Object -Unique is unexpectedly case-sensitive for string input, even though PowerShell is normally case-insensitive by default - see GitHub issue #12059.

Testing whether a cmdlet produces streaming output or collects all input first:

Short of examining a cmdlet's source code, here's a way to test - the middle pipeline segment is the command to test:

# Test Sort-Object -Unique
# Because the command cannot stream, for conceptual reasons, 
# it takes a while for the one and only output object to appear.
1..1e5 | Sort-Object -Unique | Select-Object -First 1

# Test Select-Object -Unique
# The command *could* stream, conceptually speaking, in which case
# the output object would appear right away.
# However, as of PowerShell 7.2, the command isn't implemented
# in a streaming fashion, so it takes a - surprisingly long - while
# for the output object to appear.
# it takes a while for the one and only output object to appear.
1..1e5 | Select-Object -Unique | Select-Object -First 1

If the given pipeline above produces its one and only output object near instantly, the command of interest is streaming; if it takes a while before the output object appears, it collects all input first.

PowerShell Select-Object: Using -Unique with First/Last/Skip/Index

Select-Object -Unique pitfalls, contrast with Sort-Object:

Testing whether a cmdlet produces streaming output or collects all input first:

`Select-Object -Unique` pitfalls, contrast with `Sort-Object`: