what is a low-latency hardware algorithm to find the maximum value in an array

I need to build a low-latency, single-cycle hardware module to find the index and value of the maximum element in an array. Currently, I am using a comparator tree, but the latency is unsatisfactory. So are there any other algorithms that might have lower latency?

I expect the input array to be large (256 to 4096 elements) and the values to be small (3 to 5 bits). Also, I expect the array to be sparse i.e. many small values and few large values.

I am primarily concerned with latency; area is not quite as important.

My current implementation that uses a comparator tree looks something like this:

implicit class reduceTreeOp[A](seq: Seq[A]) {
  def reduceTree[B >: A](op: (B, B) => B): B = {
    if(seq.length == 0)
      throw new NoSuchElementException("cannot reduce empty Seq")
    var rseq: Seq[B] = seq
    while(rseq.length != 1)
      rseq = rseq.grouped(2).toSeq
        .map(s => if(s.length == 1) s(0) else op(s(0), s(1)))
    rseq(0)
  }
}
val (value, index) = array
  .zipWithIndex
  .map{case (v, i) => (v, i.U)}
  .reduceTree[(UInt, UInt)]{case ((val1, idx1), (val2, idx2)) =>
    val is1 = val1 >= idx2
    ( Mux(is1, val1, idx2),
      Mux(is1, idx1, idx2))
  }

FWIW this is designed for 7nm hardware; though I doubt this actually matters to my question.

Solution

I found an algorithm that works by inverting the array. Essentially, a new array is created with a slot for each possible value, each element of the original array is sorted into a slot depending on the value, and finally, the new array can be priority encoded to find the highest slot with an element.

The implementation looks something like the following:

// VALUE_SPACE: number of possible values
val exists = WireDefault(VecInit(Seq.fill(VALUE_SPACE)(false.B)))
val index = Wire(Vec(VALUE_SPACE, UInt(log2ceil(array.length).W)))
index := DontCare

array.zipWithIndex.foreach{case (v, i) =>
  exists(v) := true.B
  index(v) := i.U
}

maxVal := VALUE_SPACE.U - PriorityEncoder(exists.reverse)
maxidx := PriorityMux(exists.reverse, index.reverse)

This algorithm gave me approximately 2.5x speedup over the comparator tree when using an array with 256 4-bit elements. However, it used much more area, and it probably only has benefit if the number of possible values is very small.