How to avoid wasting huge amounts of memory when processing DPDK mbufs?

DPDK: 22.03 PMD: Amazon ENA

We have a DPDK application that only calls rte_eth_rx_burst() (we do not transmit packets) and it must process the payload very quickly. The payload of a single network packet MUST be in contiguous memory.

The DPDK API is optimized around having memory pools of fixed-size mbufs in memory pools. If a packet is received on the DPDK port that is larger than the mbuf size, but smaller than the max MTU then it will be segmented according to the figure below:

This leads us the following problems:

If we configure the memory pool to store large packets (for example max MTU size) then we will always store the payload in contiguous memory, but we will waste huge amounts memory in the case we receive traffic containing small packets. Imagine that our mbuf size is 9216 bytes, but we are receiving mostly packets of size 100-300 bytes. We are wasting memory by a factor of 90!
If we reduce the size of mbufs, to let's say 512 bytes, then we need special handling of those segments in order to store the payload in contiguous memory. Special handling and copying hurts our performance, so it should be limited.

My final question:

What strategy is recommended for a DPDK application that needs to process the payload of network packets in contiguous memory? With both small (100-300 bytes) and large (9216) packets, without wasting huge amounts of memory with 9K-sized mbuf pools? Is copying segmented jumbo frames into a larger mbuf the only option?

Solution

There are a couple of ways involving the use of HW and SW logic to make use of multiple-size mempool.

via hardware:

If the NIC PMD supports packet or metadata (RX descriptor) parsing, one can use RTE_FLOW RAW to program the flow direction to a specific queue. Where each can be set up with desired rte_mempool.
IF the NIC PMD does not support parsing of metadata (RX descriptors) but the user is aware of specific protocol fields like ETH + MPLS|VLAN or ETH + IP + UDP or ETH + IP + UDP + Tunnel (Geneve|VxLAN); one can use RTE_FLOW to distribute the traffic over specific queues (which has larger mempool object size). thus making default traffic to fall on queue-0 (which has smaller mempool object size)
if hardware option of flow bifurcate is available, one can set the RTE_FLOW with raw or tunnel headers to be redirect to VF. thus PF can make use of smaller object mempool and VF can make use of larger size mempool.

via software: (if HW supported is absent or limited)

Using RX callback (rte_rx_callback_fn), one can check mbuf->nb_segs > 1 to confirm multiple segments are present and then use mbuf_alloc from larger mempool, attach as first segment and then invoke rte_pktmbuf_linearize to move the content to first buffer.
Pre set all queue with large size mempool object, using RX callback check mbuf->pktlen < [threshold size], if yes alloc mbuf from smaller pool size, memcpy the content (pkt data and necessary metadata) and then swap the original mbuf with new mbuf and free the original mbuf.

Pros and Cons:

SW-1: this costly process, as multiple segment access memory is non-contiguous and will be done for larger size payload such as 2K to 9K. hardware NIC also has to support RX scatter or multi-segment too.
SW-2: this is less expensive than SW-1. As there is no multiple segments, the cost can be amortized with mtod and prefetch of payload.

note: in both cases, the cost of mbuf_free within RX-callback can be reduced by maintaining a list of original mbufs to free.

Alternative option-1 (involves modifying the PMD):

modify the PMD code probe or create to allocate mempool for large and small objects.
set MAX elements per RX burst as 1 element
use scalar code path only
Change recv function to

check the packet size from RX descriptor
comment the original code to replenish per threshold
check the packet size via reading packet descriptor.
alloc for either a large or small size mempool object.

[edit-1] based on the comment update DPDK version is 22.03 and PMD is Amazon ENA. Based on DPDK NIC summary and ENA PMD it points to

No RTE_FLOW RSS to specific queues.
No RTE_FLOW_RAW for packet size.
In file in function ena_rx_queue_setup; it supports individual rte_mempool

Hence current options are

Modify the ENA PMD to reflect support for multiple mempool size
Use SW-2 for rx_callback to copy smaller payload to new mbuf and swap out.

Note: There is an alternate approach by

creating an empty pool with external mempool
Use modified ENA PMD to get pool objects as single small buffers or multiple continuous pool objects.

Recommendation: Use a PMD or Programmable NIC which can bifurcate based on Packet size and then RTE_FLOW to a specific queue. To allow multiple CPU to process multiple flow setup Q-0 as default small packets, and other queues with RTE_FLOW_RSS with specific mempool.