Verilog: Minimal (hardware) algorithm for multiplying a binary input to its delayed form

I have a binary input in (1 bit serial input) which I want to delay by M clock pulses and then multiply (AND) the 2 signals. In other words, I want to evaluate the sum:

sum(in[n]*in[n+M])

where n is expressed in terms of number of clock pulses.

The most straightforward way is to store in a memory buffer in_dly the latest M samples of in. In Verilog, this would be something like:

always @(posedge clock ...)
    ...
    in_dly[M-1:0] <= {in_dly[M-2:0], in}; 
    if (in_dly[M-1] & in)
         sum <= sum + 'd1;
    ...

While this works in theory, with large values of M (can be ~2000), the size of the buffer is not practical. However, I was thinking to take advantage of the fact that the input signal is 1 bit and it is expected to toggle only a few times (~1-10) during M samples.

This made me think of storing the toggle times from 2k*M to (2k+1)*M in an array a and from (2k+1)*M to (2k+2)*M in an array b (k is just an integer used to generalize the idea):

reg [10:0]      a[0:9];     //2^11 > max(M)=2000  and "a" has max 10 elements
reg [10:0]      b[0:9];     //same as "a"

Therefore, during M samples, in = 'b1 during intervals [a[1],a[2]], [a[3],a[4]], etc. Similarly, during the next M samples, the input is high during [b[1],b[2]], [b[3],b[4]], etc. Now, the sum is the "overlapping" of these intervals:

min(b[2],a[2])-max(b[1],a[1]), if b[2]>a[1] and b[1]<a[2]; 0 otherwise

Finally, the array b becomes the new array a and the next M samples are evaluated and stored into b. The process is repeated until the end of in.

Comparing this "optimized" method to the initial one, there is a significant gain in hardware: initially 2000 bits were stored, and now 220 bits are stored (for this example). However, the number is still large and not very practical..

I would greatly appreciate if somebody could suggest a more optimal (hardware-wise) way or a simpler way (algorithm-wise) of doing this operation. Thank you in advance!

Edit:

Thanks to Alexey's idea, I optimized the algorithm as follows:

Given a set of delays M[i] for i=1 to 10 with M[1]<M[2]<..<M[10], and an input binary array in, we need to compute the outputs:

y[i] = sum(in[n]*in[n+M[i]]) for n=1 to length(in).

We then define 2 empty arrays a[j] and b[j] with j=1,~5. Whenever in has a 0->1 transition, the smallest index empty element a[j] is "activated" and will increment at each clock cycle. Same goes for b[j] at 1->0 transitions. Basically, the pairs (a[j],b[j]) represent the portions of in equal to 1.

Whenever a[j] equals M[i], the sum y[i] will increment by 1 at each cycle while in = 1, until b[j] equals M[i]. Once a[j] equals M[10], a[j] is cleared. Same goes for b[j]. This is repeated until the end of in.

Based on the same numerical assumptions as the initial question, a total of 10 arrays (a and b) of 11 bits allow the computation of the 10 sums, corresponding to 10 different delays M[i]. This is almost 20 times better (in terms of resources used) than my initial approach. Any further optimization or idea is welcomed!

Solution

Try this:

make array A,
every time when in==1 get free A element and write M to it.
every clock decrement all non-zero A elements,
once any decremented element becomes zero, test in, if in==1 - sum++.

Edit: algorithm above intended for input like - 00000000000010000000100010000000, while LLDinu realy needs - 11111111111110000000011111000000, so here is modified algorithm:

make array (ring buffer) A,
every time when in toggles, get free A element and write M to it.
every clock decrement all non-zero A elements,
every clock test in, if in==1 and number of non-zero A elements is even - sum++.