DMA vs Load/Store Unit

As I understand The LSU(Load/Store Unit) in a RISC architecture like Arm handles load/store calls, and DMA(Direct Memory Access) Unit is responsible for moving data independent from the processor, memory to memory, peripheral to memory, etc. What I am confused about is which one handles the prefetching of instructions or data for branch predictor or instruction/data cache. Since prefetching is not an instruction but an automatic process to speed up the processor, is this job handled by DMA? I am confused since the DMA unit is shown as an external unit in the example design given at Arm Cortex-M85 technical reference manual example design

Solution

Based on the comment question to Jake's answer

DMA is generally specific to the chip not the core (so not an ARM thing (as answered)). There are a number of MCUs that have DMA built in. So that for example you can set up some sort of data transfer, and the peripheral can go get the data for you rather than you have to service interrupts in a certain amount of time or poll. Due to limited resources and/or continuous data transfer it may have a buffer with a watermark or ping pong buffers, and this gives you time to prepare the next buffer while the peripheral uses DMA to transfer from the current buffer.

Do not assume that DMA is free or fast. Many folks make that mistake; it is very much based on the system design. Sometimes the DMA transfers happen during unused bus slots and for the most part feel free. Some designs intentionally leave slots just in case you are doing DMA. I think it is wasteful, but I have seen that. And also there are designs (ARM-based even) where the DMA takes over the bus for a period of time and the CPU essentially is stalled: as soon as it needs to touch that bus (fetching or load/store) it is stalled until DMA completes.

Ask yourself in your design whether you have data transfers in/out of a peripheral that you do not have storage for in the peripheral and want to use the SRAM used by the processor? Call it DMA or just an arbiter but you will want to then design your SRAM interface so that either the ARM or peripheral can access the SRAM. Ideally without too much performance pain on either one, and or let the programmer chose some rate; DMA only one transfer every X clocks...

Or do you have storage on the peripheral for a whole transfer, but moving that transfer to/from SRAM for the processor to operate would burn a fair amount of load/store operations on the processor? And that may also want a DMA transfer capability so that the processor can fire and forget and poll or wait for an interrupt to know the transfer has completed.

ARM docs just get you the ARM bus,;your system is not necessarily ARM bus, your SRAM doesn't have an ARN bus (nor your DDR controller on a larger system), nor the peripherals, etc, generally. That is often driven by the peripheral or SRAM so you are already gluing it all together as you know. That is where the DMA lives usually. You would buffer up ARM transfers in your logic (you would anyway) as well as peripheral driven if the peripheral can be a bus master, and then arbitrate the shared resource.

Recommendations for resources is certainly not what this site is for and is a quick way to get a question closed.

I'm confused as to why you are asking this because if you have the resources to actually build a chip, this is all basic chip design stuff. And to build something with an ARM in it (I guess other than educational FPGA work) really adds to the cost.

At the end of the day, do you have peripherals/transfers that you don't want to overly burden the processor with, or the processor cannot handle due to bus timing, interrupt latency, etc? Overly burdened would start off with senior members of the software team warning you that if you try to go into production with this design they will not write software to support it and it will fail. Historically there is a wall, but these days with pretty much all chip startups failing, silicon, hardware, and software teams all need to work together from the inception of the chip and through simulation and emulation.

Knowing your partners allows for give and take: if you give me DMA on this one then your FIFO can be smaller or slower; I want to be able to poll my way through it for various reasons but also have an interrupt with at least a 50% watermark (or ping pong buffers). So I can offer you some logic that makes this software task much easier if you are interested, a CRC engine or hashing, etc. - trivial for me, time consuming for you. And so on.

The real bottom line is work with your software and hardware (PCB, put the part on a board with other components, packaging, electrical specs, etc) folks. Very quickly between your thoughts/experience on peripheral implementation and the software/hardware team's experience it should quickly close on all the data transfer solutions for all the peripherals inside and outside the chip. And not all are assumed to want DMA nor use the same engine if you make it its own engine.