I just read an article that explains the zero-copy mechanism.
It talks about the difference between zero-copy with and without Scatter/Gather supports.
NIC without SG support, the data copies are as follows
NIC with SG support, the data copies are as follows
In a word, zero-copy with SG support can eliminate one CPU copy.
My question is that why data in kernel buffer could be scattered?
Because the Linux kernel's mapping / memory allocation facilities by default will create virtually-contiguous but possibly physically-disjoint memory regions.
That means the read from the filesystem which sendfile()
does internally goes to a buffer in kernel virtual memory, which the DMA code has to "transmogrify" (for lack of a better word) into something that the network card's DMA engine can grok.
Since DMA (often but not always) uses physical addresses, that means you either duplicate the data buffer (into a specially-allocated physically-contigous region of memory, your socket buffer above), or else transfer it one-physical-page-at-a-time.
If your DMA engine, on the other hand, is capable of aggregating multiple physically-disjoint memory regions into a single data transfer (that's called "scatter-gather") then instead of copying the buffer, you can simply pass a list of physical addresses (pointing to physically-contigous sub-segments of the kernel buffer, that's your aggregate descriptors above) and you no longer need to start a separate DMA transfer for each physical page. This is usually faster, but whether it can be done or not depends on the capabilities of the DMA engine.