network-programming hardware performance-testing low-latency infiniband

InfiniBand network performance

I am measuring the performance of InfiniBand using iperf.

It's a one-to-one connection between a server and a client.

I measured the bandwidth changing number of threads which request Network I/Os.

( The cluster server has:

"Mellanox ConnectX-3 FDR VPI IB/E Adapter for System x" and
"Infiniband 40 Gb Ethernet / FDR InfiniBand" )

Bandwidth:

 1 thread  : 1.34 GB/sec,
 2 threads : 1.55 GB/sec ~ 1.75 GB/sec,
 4 threads : 2.38 GB/sec,
 8 threads : 2.03 GB/sec,
16 threads : 2.00 GB/sec,
32 threads : 1.83 GB/sec.

As you see above, Bandwidth goes up until 4 threads and decreases after it.
Could you give me some ideas in understanding what's happening there?

Additionally what happens once many machines send data to one machine? _(contention)
Can InfiniBand handle that too?

Solution

There are alot of things going under the covers here. But one of the biggest bottlenecks in infiniband is the QP cache in the firmware.

The firmware has a very very small QP cache (of the order of 16 - 32) depending upon which adaptor you are using. When the number of active Qps exceeds this cache, then any benefit of using IB starts to degenerate. From what I know, the performance penalty for a cache miss is of the order of mili seconds.. yes thats right.. milliseconds..

There are many other caches involved.

Ib has multiple different transports, with 2 most common being: 1. RC - Reliable Connected 2. UD - Unreliable Datagram

Reliable Connected mode is somewhat like TCP in that it requires an explicit connection, and is point 2 point between 2 processes. Each process allocates a QP (Queue Pair) which is similar to a socket in the ethernet world. But QP is a much more expensive and resource than a socket for many different reasons.

UD : unreliable datagram mode is like UDP in that it does not need a connection. A sing UD Qp can talk to any number of remote UD Qps.

If your data model is 1 to many.. i.e 1 machine to many machines and you need a reliable connection with huge data sizes, then you are out of luck. IB starts losing some of its effectiveness.

If you have the resources to build a reliable layer on top, then use UD for getting scalability.

If you data model is 1 to many, but the many remote processes reside on the same machine, then you can use RDS (reliable Datagram service) which is a Socket interface to use Infiniband and multiplexes many connections over a single RC connections between 2 machines. (RDS has its own set of weird issues but its a start..)

There is a 3rd newish transport called XRC which mitigates some scalability issues as well but has its own caveats.