algorithm cluster-analysis hierarchical-clustering

Algorithmic complexity of group average clustering

I've been reading lately about various hierarchical clustering algorithms such as single-linkage clustering and group average clustering. In general, these algorithms don't tend to scale well. Naive implementations of most hierarchical clustering algorithms are O(N^3), but single-linkage clustering can be implemented in O(N^2) time.

It is also claimed that group-average clustering can be implemented in O(N^2 logN) time. This is what my question is about.

I simply do not see how this is possible.

Explanation after explanation, such as:

http://nlp.stanford.edu/IR-book/html/htmledition/time-complexity-of-hac-1.html

http://nlp.stanford.edu/IR-book/completelink.html#averagesection

https://en.wikipedia.org/wiki/UPGMA#Time_complexity

... are claiming that group average hierarchical clustering can be done in O(N^2 logN) time by using priority queues. But when I read the actual explanation or pseudo-code, it always appears to me that it is nothing better than O(N^3).

Essentially, the algorithm is as follows:

For an input sequence of size N:

Create a distance matrix of NxN #(this is O(N^2) time)
For each row in the distance matrix:
   Create a priority queue (binary heap) of all distances in the row

Then:

For i in 0 to N-1:
  Find the min element among all N priority queues # O(N)
  Let k = the row index of the min element

  For each element e in the kth row:
    Merge the min element with it's nearest neighbor
    Update the corresponding values in the distance matrix
    Update the corresponding value in priority_queue[e]

So it's that last step that, to me, would seem to make this an O(N^3) algorithm. There's no way to "update" an arbitrary value in the priority queue without scanning the queue in O(N) time - assuming the priority queue is a binary heap. (A binary heap gives you constant access to the min element and log N insertion/deletion, but you can't simply find an element by value in better than O(N) time). And since we'd scan the priority queue for each row element, for each row, we get (O(N^3)).

The priority queue is sorted by a distance value - but the algorithm in question calls for deleting the element in the priority queue which corresponds to k, the row index in the distance matrix of the min element. Again, there's no way to find this element in the queue without an O(N) scan.

So, I assume I'm probably wrong since everyone else is saying otherwise. Can someone explain how this algorithm is somehow not O(N^3), but in fact, O(N^2 logN) ?

Solution

I think you are saying that the problem is that in order to update an entry in a heap you have to find it, and finding it takes time O(N). What you can do to get round this is to maintain an index that gives, for each item i, its location heapPos[i] in the heap. Every time you swap two items to restore the heap invariant you then need to modify two entries in heapPos[i] to keep the index correct, but this is just a constant factor on the work done in the heap.