DPDK: 22.03 PMD: Amazon ENA
We have a DPDK application that only calls rte_eth_rx_burst()
(we do not transmit packets) and it must process the payload very quickly. The payload of a single network packet MUST be in contiguous memory.
The DPDK API is optimized around having memory pools of fixed-size mbufs in memory pools. If a packet is received on the DPDK port that is larger than the mbuf size, but smaller than the max MTU then it will be segmented according to the figure below:
This leads us the following problems:
If we configure the memory pool to store large packets (for example max MTU size) then we will always store the payload in contiguous memory, but we will waste huge amounts memory in the case we receive traffic containing small packets. Imagine that our mbuf size is 9216 bytes, but we are receiving mostly packets of size 100-300 bytes. We are wasting memory by a factor of 90!
If we reduce the size of mbufs, to let's say 512 bytes, then we need special handling of those segments in order to store the payload in contiguous memory. Special handling and copying hurts our performance, so it should be limited.
My final question:
There are a couple of ways involving the use of HW and SW logic to make use of multiple-size mempool.
via hardware:
RAW
to program the flow direction to a specific queue. Where each can be set up with desired rte_mempool
.ETH + MPLS|VLAN
or ETH + IP + UDP
or ETH + IP + UDP + Tunnel (Geneve|VxLAN)
; one can use RTE_FLOW to distribute the traffic over specific queues (which has larger mempool object size
). thus making default traffic to fall on queue-0 (which has smaller mempool object size
)flow bifurcate
is available, one can set the RTE_FLOW with raw or tunnel
headers to be redirect to VF
. thus PF can make use of smaller object mempool
and VF can make use of larger size mempool
.via software: (if HW supported is absent or limited)
rte_rx_callback_fn
), one can check mbuf->nb_segs > 1
to confirm multiple segments are present and then use mbuf_alloc from larger mempool, attach as first segment and then invoke rte_pktmbuf_linearize to move the content to first buffer.mbuf->pktlen < [threshold size]
, if yes alloc mbuf from smaller pool size, memcpy the content (pkt data and necessary metadata) and then swap the original mbuf with new mbuf and free the original mbuf.Pros and Cons:
SW-1: this costly process, as multiple segment access memory is non-contiguous and will be done for larger size payload such as 2K to 9K. hardware NIC also has to support RX scatter or multi-segment too.
SW-2: this is less expensive than SW-1. As there is no multiple segments, the cost can be amortized with mtod and prefetch of payload
.
note: in both cases, the cost of mbuf_free
within RX-callback can be reduced by maintaining a list of original mbufs to free.
Alternative option-1 (involves modifying the PMD):
probe or create
to allocate mempool for large and small objects.1 element
recv
function to[edit-1] based on the comment update DPDK version is 22.03
and PMD is Amazon ENA
. Based on DPDK NIC summary and ENA PMD it points to
ena_rx_queue_setup
; it supports individual rte_mempool
Hence current options are
Note: There is an alternate approach by
Recommendation: Use a PMD or Programmable NIC which can bifurcate based on Packet size and then RTE_FLOW to a specific queue. To allow multiple CPU to process multiple flow setup Q-0 as default small packets, and other queues with RTE_FLOW_RSS with specific mempool.