Contigous and boundary aligned memory allocation in Linux kernel (pcie + bus-master + scatter-gather)

I want to write a driver for my PCI-e device that has a bus-master DMA engine with scatter-gather capability. I've already did this for Windows, and now trying to make the same for Linux.

Bus-master engine of my PCI-e device has few limitations:

  1. It is only 32-bits, in any words it can not address more than 4 GB of RAM.
  2. Scatter-gather list can not be placed in anywhere of RAM, it must be boundary aligned by 64kB.
  3. Scatter-gather list must be contiguous in RAM.

And now I'm totally stacked in scatter-gather list and memory allocation for it in Linux kernel. For example, I can use kmalloc(), as I read it is always allocates contiguous (it's fine) memory in lower 896MB of RAM (it's fine again, 896MB is lower than 4GB), and it is ok for me. But what about boundary alignment? How to reach this in Linux? Should I care about something to mark allocated memory as non-cached and/or non-pageable?

This is a part of my Windows driver. As you can see contiguous and boundary aligned memory allocation is pretty simple.

// This is a scatter-gather entry, it is called PRD,
// Physical Region Descriptor
typedef struct _PRD
    ULONG BaseAddress; // Buffer physical address
    USHORT ByteCount; // Buffer length
    USHORT EoT; // End of Table marker

typedef struct _DEVICE_EXTENSION
    PPRD PRDT; // Pointer to Physical Region Descriptor Table

NTSTATUS StartDevice(_In_ PIRP Irp, _In_ PIO_STACK_LOCATION Stack, _In_ PDEVICE_EXTENSION DeviceExtension)
    PHYSICAL_ADDRESS LowestAcceptableAddress = { 0x0,0x0 };
    PHYSICAL_ADDRESS HighestAcceptableAddress = { 0xFFFFFFFF, 0x0 }; // 4GB of RAM top limit
    PHYSICAL_ADDRESS BoundaryAddressMultiple = { 0x00010000, 0x0 }; // 64kB boundary
    ULONG Length = 4096 * sizeof(PRD); // 4096 PRD entries

    // PRDT memory allocation: 1) contiguous 2) boundary aligned and 3) non-cached
    DeviceExtension->PRDT = MmAllocateContiguousMemorySpecifyCache(Length,

    // Programming PRDT physical address to the bus-master control port 
    PHYSICAL_ADDRESS a = MmGetPhysicalAddress(DeviceExtension->PRDT);
    for (UCHAR j = 0; j < 4; ++j)
        WRITE_PORT_UCHAR(BusMasterPrdtOffsetReg + j, (a.LowPart >> (j * 8)) & 0xFF); 


  • The buffer for the hardware level scatter-gather buffers is usually mapped as a "consistent" DMA mapping (a.k.a. "synchronous" or "coherent") so that its contents are permanently synchronized between the CPU and the device.

    Memory for consistent DMA mappings is allocated by dma_alloc_coherent(). The allocated memory will have two addresses. It returns a kernel virtual address and indirectly returns (via a parameter) a DMA address in the device's address space. The DMA address will be 32-bit addressable by default, but that can be changed by calling dma_set_coherent_mask() with the appropriate mask during device initialization, e.g. err = dma_set_coherent_mask(device, DMA_BIT_MASK(32));.

    Unfortunately, the memory allocated by dma_alloc_coherent() will only be page-aligned, so it you need stricter alignment you will need to over-allocate and adjust the returned addresses. However, when freeing the memory with dma_free_coherent(), the original virtual address, DMA address, and size from dma_alloc_coherent() need to be specified. E.g:

        /* Overallocate descriptor buffer by 64KiB for alignment. */
        sgbuffer_allocation_size = sgbuffer_size + 0x10000;
        sgbuffer_allocated = dma_alloc_coherent(device, sgbuffer_allocation_size,
        if (sgbuffer_allocated) {
            /* Align to 64KiB boundary. */
            sgbuffer_dma_addr = ALIGN(sgbuffer_allocated_dma_addr, 0x10000);
            sgbuffer = (typeof(sgbuffer))((char *)sgbuffer_allocated +
                              (sgbuffer_dma_addr - sgbuffer_allocated_dma_addr);
        /* else: do error handling. */
        /* Free descriptor buffer with original dma_alloc_coherent() values. */
        dma_free_coherent(device, sgbuffer_allocation_size,
                 sgbuffer_allocated, sgbuffer_allocated_dma_addr);

    If the host memory addresses described by the descriptors are mapped for DMA as "streaming" DMA mappings (so they are mapped temporarily for the transfer), and the mapped DMA addresses need to be 32-bit addressable for the hardware, then the normal (non-coherent) DMA mask for the device needs to be set during device initialization, using the dma_set_mask() function, e.g. err = dma_set_mask(device, DMA_BIT_MASK(32));. It is also possible to set both the normal DMA mask and the coherent DMA mask to the same value with the dma_set_mask_and_coherent() function, e.g.: err = dma_set_mask_and_coherent(device, DMA_BIT_MASK(32));.

    To learn more about the difference between consistent DMA mappings and streaming DMA mappings consult DMA-API-HOWTO from the Linux kernel documentation. The concept is similar to the DMA mapping in the Windows kernel. Streaming DMA mappings have a direction, and DMA transfers that use host addresses outside hardware DMA addressing range may make use of IOMMU mappings or use a "bounce buffer". As in the Windows kernel, this is all hidden by the DMA API as long as you follow the rules (although for performance reasons, you might want to try and avoid the use of bounce buffers).