I want to start by saying my code works as intended, and is reasonably fast. However profiling it, most of the time is spent on one very specific portion, which leads me to ask: is there any generally accepted better solution for this?
Here is my implementation:
var cellDistance = new double[cells.Count];
cellDistance.SetAll(idx => idx == startCellIndex ? 0 : double.PositiveInfinity);
var visitedCells = new HashSet<int>();
do
{
// current cell is the smallest unvisited tentative distance cell
var currentCell = cells[cellDistance.Select((d, idx) => (d, idx)).OrderBy(x => x.d).First(x => !visitedCells.Contains(cells[x.idx].Index)).idx];
foreach (var neighbourCell in currentCell.Neighbours)
if (!visitedCells.Contains(neighbourCell.Index))
{
var distanceThroughCurrentCell = cellDistance[currentCell.Index] + neighbourCell.Value;
if (cellDistance[neighbourCell.Index] > distanceThroughCurrentCell)
{
cellDistance[neighbourCell.Index] = distanceThroughCurrentCell;
prevCell[neighbourCell] = currentCell;
}
}
visitedCells.Add(currentCell.Index);
} while (visitedCells.Count != cells.Count && !visitedCells.Contains(endCell.Index));
Most of the time is spent on this line, which takes the unvisited node with the lowest partial cost:
var currentCell = cells[cellDistance.Select((d, idx) => (d, idx)).OrderBy(x => x.d).First(x => !visitedCells.Contains(cells[x.idx].Index)).idx];
And more specifically, in the last lambda, not the sort (which I found very surprising):
x => !visitedCells.Contains(cells[x.idx].Index)
Since visitedCells
is already a HashSet
, there isn't much I can improve with just the built-in data structures, so my question is: is there a different way of storing the partial costs that make this specific query (ie, the unvisited node with the lowest partial cost) noticeably faster?
I was considering some kind of sorted dictionary, but that I'd need one that sorts by value, because if it's sorted by key I'd have to make the partial cost the key, which makes updating it costly and then poses the problem as to how I map this structure to my cost array, and this still doesn't solve my visitedCells
lookup.
Using an array of flags instead of HashSet
HashSet can have amortized insertion time and expected query time of O(1). However, since your node ids are simply indices into an array, they are sequential and they don't grow much. Also, you will eventually have all ids in the HashSet. In this case, you have faster O(1) options than using "any" generic hash table. You can use an array of booleans that show if a node was visited, and index into it using node ids.
Simply allocate a boolean array with size equal to node count. Fill it with false
. Set the value at node id to true
when you visit a new node.
Iterating over all nodes instead of sorting them for selecting the next node
Your current code has to sort all nodes with respect to their distances, and then go through them one by one to find the first non-visited one. This takes θ(nlogn) time in most cases due to sorting. (An optimization could be made to partially-sort the nodes, but it would be very surprising if a compiler/library could see that opportunity by itself.) Your total time complexity becomes θ(n^2 * logn) with this approach. Instead, you can go through the nodes once, keeping track of the minimum-distance non-visited node seen so far. This works in θ(n). Total time complexity is O(n^2), as Dijkstra should be.
With these two changes, your code will not have much left that is unneeded for Dijkstra's shortest path.
I was considering some kind of sorted dictionary, but that I'd need one that sorts by value, because if it's sorted by key I'd have to make the partial cost the key, which makes updating it costly and then poses the problem as to how I map this structure to my cost array
There is a data structure called min-heap that can be used to extract the minimum value from a set (along with its satellite data). A simple binary min-heap can extract the minimum key or decrease some key it holds in θ(logn) worst-case time.
In the case of Dijkstra, you need to have a sparse graph for this to be more efficient than iterating over all distances (sparse graph ≈ number of edges is much less than number of nodes squared). Because the algorithm may need to decrease a distance each time it relaxes an edge.
If there are θ(n^2) edges, this makes the worst-case total time complexity θ(n^2 * logn).
If there are θ(n^2 / logn) edges, time taken in relaxations becomes O(n^2). Then, you need a more sparse graph than this one, for a binary heap to be more efficient than using a simple array.
In the worst case, extracting all minimum-distance nodes from the heap takes θ(nlogn) time, relaxing all edges takes θ(e * logn) time, where e is edge count, and total time is θ((n+e)logn). As I said, this can be more efficient than θ(n^2) only if e is asymptotically smaller than n^2 / logn.