I am just considering which provides me the best performance when I use both OrderBy()
and Distinct()
inside a LINQ Query. It seems to me they're both equal in speed as the Distinct()
method will use a hash table while in-memory and I assume that any SQL query would be optimized first by .NET before it gets executed.
Am I correct in assuming this or does the order of these two commands still affect the performance of LINQ in general?
As for how it would work... When you build a LINQ query, you're basically building an expression tree but nothing gets executed yet. So calling MyList.Distinct().OrderBy()
would just make this tree, yet won't execute it. (It's deferred.) Only when you call another function like ToList()
would the expression tree get executed and the runtime could optimize the expression tree before it gets executed.
For LINQ to objects even if we assume that that OrderBy(...).Distinct()
and Distinct().OrderBy(...)
will return the same result (which is not guaranteed) the performance will depend on the data.
If you have a lot of duplication in data - running Distinct
first should be faster. Next benchmark shows that (at least on my machine):
public class LinqBench
{
private static List<int> test = Enumerable.Range(1, 100)
.SelectMany(i => Enumerable.Repeat(i, 10))
.Select((i, index) => (i, index))
.OrderBy(t => t.index % 10)
.Select(t => t.i)
.ToList();
[Benchmark]
public List<int> OrderByThenDistinct() => test.OrderBy(i => i).Distinct().ToList();
[Benchmark]
public List<int> DistinctThenOrderBy()=> test.Distinct().OrderBy(i => i).ToList();
}
On my machine for .Net Core 3.1 it gives:
Method | Mean | Error | StdDev |
---|---|---|---|
OrderByThenDistinct | 129.74 us | 2.120 us | 1.879 us |
DistinctThenOrderBy | 19.58 us | 0.384 us | 0.794 us |