Search code examples
tcplinux-kernelzero-copy

Is it possible to map memory allocated with kmalloc to userspace, for use in sendmsg with the MSG_ZEROCOPY option over TCP?


I am working on a project where we have a kernel module that allocates memory using kmalloc(), and maps it to userspace using remap_pfn_range(). In remap_pfn_range() the vm_area_structs has its vm_flags field set to VM_IO | VM_PFNMAP (as well as some other flags that seem irrelevant to my problem). When I call sendmsg() using the address returned by our mmap() implementation (which is where the remap_pfn_range() call happens) the sendmsg() call fails with errno EFAULT.

I have tracked down the location in the kernel where -EFAULT is first returned, it happens in check_vma_flags() in mm/gup.c, the problem is that the above mentioned flags are set in the vm_area_struct:

static int check_vma_flags(struct vm_area_struct *vma, unsigned long gup_flags)
{
    vm_flags_t vm_flags = vma->vm_flags;
    int write = (gup_flags & FOLL_WRITE);
    int foreign = (gup_flags & FOLL_REMOTE);

    if (vm_flags & (VM_IO | VM_PFNMAP))
        return -EFAULT; // <= This error propagates all the way back to userspace.

For background, the allocated memory is shared between the CPU and an FPGA in a Cyclone V SOC-FPGA. The FPGA writes the bulk of the data to the buffers and requires the memory to be physically contiguous. We are currently using v5.3-rc8 of the linux kernel.

The memory mapping works well for regular send() calls without any attempts at zero-copy, but now we need to improve the network transmission speed for our data. In investigating the zero-copy mode for sendmsg() we have observed up to 50% improvements in data transfer rate, when we are using regular userspace allocated memory through malloc().

I have tried unsetting the vm_area_struct.vm_flags bits that correspond to VM_IO and VM_PFNMAP after the call to remap_pfn_range(). This results in sendmsg() actually "succeeding", the syscall returns the size of the buffer that I wished to send. However, no data arrives on the client side. Also, when I reboot the kernel, there are a lot of errors printed on the serial console.

I don't expect what I did to be the correct way to do this. I have to assume that the vm_flags are set for a very good reason, and there are no conditions in remap_pfn_range() that would prevent the flags from being set.

Hence, my question in the title: Is there a way to map memory obtained form kmalloc() to userspace such that it can be passed to sendmsg() with MSG_ZEROCOPY?

Update: I see that there have been a few posts regarding this subject already:

I will try the suggested answers from those posts and report back here.


Solution

  • I have managed to implement a solution similar to what is described in steps 3 and 4 of Map physical memory to userspace as normal, struct page backed mapping.

    My fault handler looks like this

    vm_fault_t vm_fault(struct vm_fault* vmf)
    {
        unsigned long pos = vmf->vma->vm_pgoff / NPAGES;
        unsigned long offset = vmf->address - vmf->vma->vm_start;
        // Have to make the address-to-page translation manually, for unknown
        // reasons virt_to_page causes bus error when userspace writes to the
        // memory.
        unsigned long physaddr = __pa(memory_list[pos].kmalloc_area) + offset;
        unsigned long pfn = physaddr >> PAGE_SHIFT;
        struct page* page = pfn_to_page(pfn);
        // Increment refcount to page.
        get_page(page);
        return vmf_insert_page(vmf->vma, vmf->address, page);
    }
    

    and my mmap() implementation

    static int mmap_mmap(struct file* filp, struct vm_area_struct* vma)
    {
        // Do not map to userspace using remap_pfn_range(), since it those mappings
        // are incompatible with zero-copy for the sendmsg() syscall (due to the
        // VM_IO and VM_PFNMAP flags). Instead set up mapping using the fault handler.
        vma->vm_ops = &imageaccess_vm_operations;
        vma->vm_flags |= VM_MIXEDMAP; // Must be set, or else vmf_insert_page() will fail.
        return 0;
    }
    

    I have not figured out why virt_to_page() does not seem to work, if anyone has an idea, feel free to comment. This implementation works in any case.