Can cache coherency protocols like snooping coherence be implemented in hardware(RTL)?

Can cache coherency protocols like snooping coherence protocol and MESI/MOESI be implemented in hardware(RTL)? I am designing an RTL for multicore cache environment, and need to implement the cache coherency protool in that to get coherent and consistent data for all the processors. This is just an academic exercise.

Any leads would indeed be helpful. I have the state diagram for MSI, should I try to implement a FSM from that first? I am developing the code for synthesizable verilog/systemverilog.

The FSM should be different for every cache block, so is there a mux connected to the FSM state machine controller like below?

Solution

Can cache coherency protocols like snooping coherence protocol and MESI/MOESI be implemented in hardware(RTL)?

Yes. They have been implemented in VLSI for years, as the people who have commented already stated.

should I try to implement a FSM [for Modified/Shared/Invalid (MSI) cache-coherence protocol] first?

I suppose so. It is certainly one of the easier ones. You didn't mention whether or not you've already implemented a basic cache controller or not. I feel like you should start with a single-processor cache controller first and then expand it to peek at other processor's caches from there.

Personally, I found Computer Organization and Design: The Hardware/Software Interface Fifth Edition very helpful when getting into cache architectures. I attached a screen shot of a piece of the book which talks about building a cache controller.

From that point, you might look at resources like these, which dive into an architectural design of cache controllers with cache-coherency functionalities. I pulled these up with basic Google searches.

is there a mux connected to the FSM state machine controller like below?

To access a block/line in the cache, yes. However, remember that your coherency controller may not need to have access to the data in the cache. Your coherency logic most likely only needs to read and/or modify the state bits of the block. This could save you some bit width. So I might draw my diagram like this:

This is just an idea, not a concrete "this is how you should do it" answer.

Since you're using MSI, you really only need 2 bits to represent those three states. Getting and outputting only 2 bits should be more efficient than getting the whole entire block with its data for the coherency controller.

// Note this code has not been tested. It is simply for illustrative purposes only.
module <cache name> (

input [X:0] address,
output logic [1:0] block_state

);

    // Cache memory blocks (32-bits of data. 2 bits of "state" info)
    logic [33:0] cache_mem [0:N];

    // Give index bits of address a name
    wire [Y:0] address_index;
    assign address_index[Y:0] = address[A:B];

    // Output the state bits of the selected cache block.
    assign block_state[1:0] = cache_mem[33:32][address_index];

endmodule

Hopefully some of that is helpful for you! Cheers!