Why do we need DMA pool?

I'm reading https://www.kernel.org/doc/Documentation/DMA-API.txt and I don't understand why DMA pool is needed.

Why not having PAGE_SIZE DMA allocated memory dma_alloc_coherent and use offsets?

Also, why is dynamic DMA useful for networking device driver instead of reusing the same DMA memory?

What is the most performant for <1KB data transfers?

Solution

Warning: I'm not expert in linux kernel.

LDD book (which may be better reading to start) says that DMA pool works better for smaller dma regions (shorter than page) - https://static.lwn.net/images/pdf/LDD3/ch15.pdf page 447 or https://www.oreilly.com/library/view/linux-device-drivers/0596005903/ch15.html, "DMA pools" section:

A DMA pool is an allocation mechanism for small, coherent DMA mappings. Mappings obtained from dma_alloc_coherent may have a minimum size of one page. If your device needs smaller DMA areas than that, you should probably be using a DMA pool. DMA pools are also useful in situations where you may be tempted to perform DMA to small areas embedded within a larger structure. Some very obscure driver bugs have been traced down to cache coherency problems with structure fields adjacent to small DMA areas. To avoid this problem, you should always allocate areas for DMA operations explicitly, away from other, non-DMA data structures. ... Allocations are handled with dma_pool_alloc

Same is stated in https://www.kernel.org/doc/Documentation/DMA-API-HOWTO.txt

If your driver needs lots of smaller memory regions, you can write custom code to subdivide pages returned by dma_alloc_coherent(), or you can use the dma_pool API to do that. A dma_pool is like a kmem_cache, but it uses dma_alloc_coherent(), not __get_free_pages(). Also, it understands common hardware constraints for alignment, like queue heads needing to be aligned on N byte boundaries.

So DMA pools are optimization for smaller allocations. You can use dma_alloc_coherent for every small dma memory individually (with larger overhead) or you can try to build your own pool (more custom code to manage offsets and allocations). But DMA pools are already implemented and they can be used.

Performance of methods should be profiled for your case.

Example of dynamic dma registration in network driver (used for skb fragments): https://elixir.bootlin.com/linux/v4.6/source/drivers/net/ethernet/realtek/r8169.c

static struct sk_buff *rtl8169_alloc_rx_data
    mapping = dma_map_single(d, rtl8169_align(data), rx_buf_sz,
                 DMA_FROM_DEVICE);
static int rtl8169_xmit_frags
        mapping = dma_map_single(d, addr, len, DMA_TO_DEVICE);
static netdev_tx_t rtl8169_start_xmit
    mapping = dma_map_single(d, skb->data, len, DMA_TO_DEVICE);
static void rtl8169_unmap_tx_skb
    dma_unmap_single(d, le64_to_cpu(desc->addr), len, DMA_TO_DEVICE);

Registering skb fragments for dma in-place can be better (if sg dma is supported by NIC chip) than copying every fragment from skb into some DMA memory. Check "Understanding Linux Network Internals" book for section "dev_queue_xmit Function" and Chapter 21; and skb_linearize

Example of DMA pool usage - nvme driver (prp is part of Submission Queue Element, Physical Region Pages, 64bit pointer, and "PRP List contains a list of PRPs with generally no offsets."):

https://elixir.bootlin.com/linux/v4.6/source/drivers/nvme/host/pci.c#L1807

static int nvme_setup_prp_pools(struct nvme_dev *dev)
{
    dev->prp_page_pool = dma_pool_create("prp list page", dev->dev,
                        PAGE_SIZE, PAGE_SIZE, 0);

static bool nvme_setup_prps
    prp_list = dma_pool_alloc(pool, GFP_ATOMIC, &prp_dma);

static void nvme_free_iod
        dma_pool_free(dev->prp_small_pool, list[0], prp_dma);