Search code examples
dpdk

DPDK bad throughput, is the configurations' fault? How to improve?


BACKGROUND:

I am trying to write a DPDK app which is supposed to handle packets coming from inside a Virtual Machine Monitor.

Basically the VMM is getting the packets from its guest, then is going to send those packets to DPDK. Then Dpdk sends them out on the NIC.

Virtual Machine -> Virtual Machine Manager -> DPDK -> NIC

This architecture from above is supposed to replace and outperform the original architecture. In the original architecture, the VMM is putting the packets on a TAP interface.

Original:

Virtual Machine -> Virtual Machine Manager -> TAP interface -> NIC

Problem:

I have written the new architecture and the throughput is way worse than when using the TAP interface. (TAP 300 MB/s any direction, Dpdk: 50MB/s VM sender, 5MB/s VM receiver)

I am suspecting that I am not configuring my DPDK Application properly. Could you give an opinion on my configurations?

Environment:

I have done all the testing inside a Qemu virtual machine, so the architectures described above were both ran inside this Virtual Machine:

3 logical CPUs (out of 8 on host)

4096 MB memory

OS: Ubuntu 20.4

2 NICs, one for SSH and one for DPDK

What I did so far:

2GB Hugepages

Isolated the cpu which DPDK is using.

Here is the code: https://github.com/mihaidogaru2537/DpdkPlayground/blob/Strategy_1/primary_dpdk_firecracker/server.c

All functional logic is in "lcore_main", everything else is just configuration.

All the advice I could find about increasing performance would involve hardware stuff and not configuration parameters, I don't know if the values I am using for things such as:

#define RX_RING_SIZE 2048
#define TX_RING_SIZE 2048

#define NUM_MBUFS 8191
#define MBUF_CACHE_SIZE 250
#define BURST_SIZE 32

are ok or not. Although I got them from one of the example applications from the official documentation.

Thanks for reading and let me know if you have any questions!

UPDATE 1:

1. is VMMuserspcace hypervisor which act as interface to the guest OS?

The VMM is running inside userspace, yes. The VM is running inside the memory of the VMM. This is the VMMs architecture: link

2. Where is the link to VMM?

The vmm link, this is the net device emulated by the VMM. Here is where I wrote all my modifications in order to "bind" to DPDK primary.

Here are a couple of function which interact directly with the memory of the Guest.

For a example, when a packet needs to be sent from Guest to internet:

process_tx() is called, it reads the packet from Guest and sends it to DPDK. The code doing this is at this line. And a little below is where I do the rte_ring_enqueue

3. as per the setup, DPDK run on the host. is this correct?

So everything is running inside Qemu. DPDK Primary is running in Qemu, Firecracker is running in Qemu. There is a Virtual Machine inside the memory of Firecracker (VMM). And the code of VMM now also starts a Secondary DPDK in order to communicate with Primary. Primary is sending / receiving packets on the Qemu NIC.

So if by host we mean "physical machine" then the answer is no. Dpdk is not on the physical machine.

4. You mention you run this inside QEMU. Does this mean using VM you are running DPDK and VMM with another guest OS?

Yes, both DPDK and Virtual Machine Manager are ran inside a guest OS. Which is an Ubuntu 20.04. I am literally running a VMM inside a Qemu Virtual Machine.

5.Since you are using Qemu, please share information of VCPU pinning, and memory backing on host. 

This is the physical memory which I have. About 8GB

             total        used        free      shared  buff/cache   available
Mem:          7,6Gi       4,8Gi       955Mi       136Mi       1,9Gi       2,5Gi
Swap:          10Gi          0B        10Gi

And lscpu output on physical machine.

About the VCPU pinning: on the physical machine I did not isolate any core.


Solution

  • [Answer is based on the live debug and configuration settings done to improve performance]

    Factors that were affecting performance for Kernel and DPDK interfaces were

    1. Host system was not using CPU which were isolated for VM
    2. KVM-QEMU CPU threads were not pinned
    3. QEMU was not using huge page backed memory
    4. emulator and io threads of QEMU were not pinned.
    5. Inside VM the kernel boot parameter was set to 1GB, which was causing TLB to miss on the host.

    Corrected configuration:

    1. setup host with 4 * 1GB huge page on host
    2. edit qemu XML to reflect VCPU, iothread, emulator threads on desired host CPU
    3. edit qemu to use host 4 * 1GB page
    4. edit VM grub to have isolate CPU and use 2MB pages.
    5. run DPDK application using isolated core on VM
    6. taskset firecracker thread on ubuntu

    We were able to achieve around 3.5X to 4X performance on the current DPDK code.

    note: there is a lot of room to improve code performance for DPDK primary and secondary application also.