I need to build a low-latency, single-cycle hardware module to find the index and value of the maximum element in an array. Currently, I am using a comparator tree, but the latency is unsatisfactory. So are there any other algorithms that might have lower latency?
I expect the input array to be large (256 to 4096 elements) and the values to be small (3 to 5 bits). Also, I expect the array to be sparse i.e. many small values and few large values.
I am primarily concerned with latency; area is not quite as important.
My current implementation that uses a comparator tree looks something like this:
implicit class reduceTreeOp[A](seq: Seq[A]) {
def reduceTree[B >: A](op: (B, B) => B): B = {
if(seq.length == 0)
throw new NoSuchElementException("cannot reduce empty Seq")
var rseq: Seq[B] = seq
while(rseq.length != 1)
rseq = rseq.grouped(2).toSeq
.map(s => if(s.length == 1) s(0) else op(s(0), s(1)))
rseq(0)
}
}
val (value, index) = array
.zipWithIndex
.map{case (v, i) => (v, i.U)}
.reduceTree[(UInt, UInt)]{case ((val1, idx1), (val2, idx2)) =>
val is1 = val1 >= idx2
( Mux(is1, val1, idx2),
Mux(is1, idx1, idx2))
}
FWIW this is designed for 7nm hardware; though I doubt this actually matters to my question.
I found an algorithm that works by inverting the array. Essentially, a new array is created with a slot for each possible value, each element of the original array is sorted into a slot depending on the value, and finally, the new array can be priority encoded to find the highest slot with an element.
The implementation looks something like the following:
// VALUE_SPACE: number of possible values
val exists = WireDefault(VecInit(Seq.fill(VALUE_SPACE)(false.B)))
val index = Wire(Vec(VALUE_SPACE, UInt(log2ceil(array.length).W)))
index := DontCare
array.zipWithIndex.foreach{case (v, i) =>
exists(v) := true.B
index(v) := i.U
}
maxVal := VALUE_SPACE.U - PriorityEncoder(exists.reverse)
maxidx := PriorityMux(exists.reverse, index.reverse)
This algorithm gave me approximately 2.5x speedup over the comparator tree when using an array with 256 4-bit elements. However, it used much more area, and it probably only has benefit if the number of possible values is very small.