I have a binary input in
(1 bit serial input) which I want to delay by M
clock pulses and then multiply (AND) the 2 signals. In other words, I want to evaluate the sum:
sum(in[n]*in[n+M])
where n
is expressed in terms of number of clock pulses.
The most straightforward way is to store in a memory buffer in_dly
the latest M
samples of in
. In Verilog, this would be something like:
always @(posedge clock ...)
...
in_dly[M-1:0] <= {in_dly[M-2:0], in};
if (in_dly[M-1] & in)
sum <= sum + 'd1;
...
While this works in theory, with large values of M
(can be ~2000), the size of the buffer is not practical. However, I was thinking to take advantage of the fact that the input signal is 1 bit and it is expected to toggle only a few times (~1-10) during M
samples.
This made me think of storing the toggle times from 2k*M
to (2k+1)*M
in an array a
and from (2k+1)*M
to (2k+2)*M
in an array b
(k
is just an integer used to generalize the idea):
reg [10:0] a[0:9]; //2^11 > max(M)=2000 and "a" has max 10 elements
reg [10:0] b[0:9]; //same as "a"
Therefore, during M
samples, in = 'b1
during intervals [a[1],a[2]], [a[3],a[4]], etc
. Similarly, during the next M
samples, the input is high during [b[1],b[2]], [b[3],b[4]], etc
. Now, the sum is the "overlapping" of these intervals:
min(b[2],a[2])-max(b[1],a[1]), if b[2]>a[1] and b[1]<a[2]; 0 otherwise
Finally, the array b
becomes the new array a
and the next M
samples are evaluated and stored into b
. The process is repeated until the end of in
.
Comparing this "optimized" method to the initial one, there is a significant gain in hardware: initially 2000 bits were stored, and now 220 bits are stored (for this example). However, the number is still large and not very practical..
I would greatly appreciate if somebody could suggest a more optimal (hardware-wise) way or a simpler way (algorithm-wise) of doing this operation. Thank you in advance!
Edit:
Thanks to Alexey's idea, I optimized the algorithm as follows:
Given a set of delays M[i] for i=1 to 10 with M[1]<M[2]<..<M[10]
, and an input binary array in
, we need to compute the outputs:
y[i] = sum(in[n]*in[n+M[i]]) for n=1 to length(in)
.
We then define 2 empty arrays a[j]
and b[j]
with j=1,~5
. Whenever in
has a 0->1
transition, the smallest index empty element a[j]
is "activated" and will increment at each clock cycle. Same goes for b[j]
at 1->0
transitions. Basically, the pairs (a[j],b[j])
represent the portions of in
equal to 1.
Whenever a[j]
equals M[i]
, the sum y[i]
will increment by 1 at each cycle while in = 1
, until b[j]
equals M[i]
. Once a[j]
equals M[10]
, a[j]
is cleared. Same goes for b[j]
. This is repeated until the end of in
.
Based on the same numerical assumptions as the initial question, a total of 10 arrays (a
and b
) of 11 bits allow the computation of the 10 sums, corresponding to 10 different delays M[i]
. This is almost 20 times better (in terms of resources used) than my initial approach. Any further optimization or idea is welcomed!
Try this:
A
,in
==1 get free A
element and write M
to it.A
elements,in
, if in
==1 - sum++
.Edit: algorithm above intended for input like - 00000000000010000000100010000000, while LLDinu realy needs - 11111111111110000000011111000000, so here is modified algorithm:
A
,in
toggles, get free A
element and write M
to it.A
elements,in
, if in
==1 and number of non-zero A
elements is even - sum++
.