Currently, I work for a mid-sized high-performance computing (HPC) shop. For many of the scientific codes we run, communication performance matters - both in terms of inter-machine (a.k.a., inter-node) bandwidth and latency. Like most HPC shops, we have some experience with Infiniband, but in recent years we've been using 10 Gbps Ethernet (10gigE) for a cluster interconnect. Given ethernet's prevalence, and general dominance in datacenter networking, 10gigE seems on the surface to be a general win, and a decent choice for a cluster interconnect (particularly for a user base that historically prefers gigabit ethernet for cost reasons).
I've designed three 10gigE clusters, two of which are on the current (November, 2011) Top 500 list. I do not recommend this. 10gigE has its place, but currently economics favor Infiniband for high-performance computing. If your code uses MPI, and you need more cores than you can fit in one compute node (and your code isn't embarassingly parallel - I've seen some that could operate nicely over 10 Mbps ethernet), you should be looking at Infiniband.
Rather than delving into why I've been building 10gigE clusters, this page discusses modern technology that can help you get the most performance from a high-speed ethernet fabric. Be warned, the content from here on out gets technical quickly. I've likely spent more time than is healthy examining this space, and doing so requires a fair amount of expertise in TCP, IP, ethernet, Infiniband (as well as general RDMA theory, and its multiple incarnations), operating systems, MPI libraries, and several vendors' product lines.
To quote the
xterm source code: "There be dragons here."
TCP/IP is great, for most things - but the API pretty much requires kernel
intervention. Your app calls
some library fires off a syscall, and the kernel starts formatting data to
go over the wire. Under Linux, a null syscall has an overhead of around 1000
instructions (if you'll pardon the blind assertion), so that means you can
do around 2.5 million syscalls per second on a 2.5 GHz CPU (using some vague
hand-waving to avoid calulating effects of load-store queuing and superscalar
processors). On paper, that means a hard max of around 30 Gbps of throughput -
more, with frame sizes over 1500 bytes.
Unfortunately, that's not reality. First off, a processor will need to do some data formatting and copying beyond the time to enter the syscall. Second, data arriving will also trigger syscalls. Some of this can be ameliorated (e.g., jumbo frames, interrupt coalescing, etc.) but at a cost of tying up a processor to handle the kernel's side of the communication. If your application requires frequent data exchange (like most HPC simulations), the added latency and processor overhead can greatly degrade performance - even without fully utilizing the available bandwidth.
TOE (TCP Offload Engine) NICs may help, to a limited degree. A TOE will reduce the CPU's workload, but won't significantly reduce overall message latency - unless the TOE vendor comes with a wrapper library to replace the sockets API (Solarflare does this, for example).
If you need to do RDMA over Ethernet, this is the easiest way to do it. It's not quite Infiniband, but many of the various IB-related commands in OFED will work. Many RDMA apps will work with this, and as iWARP is encapsulated by TCP/IP it can transit a router. Latency will be higher than RoCE (at least with both Chelsio and Intel/NetEffect implementations), but still well under 10 μs. iWARP is reasonably stable with recent versions of the OpenFabrics stack - in-kernel drivers may not be as stable (including those baked into Redhat Enterprise 5 and 6). Caveat emptor.
RoCE is RDMA over Converged Ethernet - but Infiniband over Ethernet would be a more apt description. Strip the GUIDs out of the IB header, replace them with Ethernet MAC addresses, and send it over the wire. As of this writing, only Mellanox (www.mellanox.com) makes RoCE-capable equipment (their CX2 and CX3 line of products).
Infiniband is a lossless physical-layer protocol, so RoCE requires lossless Ethernet. Also, since it's Ethernet, RoCE cannot transit a router. It's strictly a layer-2 protocol, and it needs a complicated layer-2.
Ethernet becomes lossless by re-using 802.1D PAUSE frames for explicit flow control. This is timing-sensitive; a receiver must send a PAUSE soon enough such that it is received and processed before the receive buffer can fill. Obviously, there are issues stretching this over some distance. Switches must be internally lossless, and must be able to send PAUSE frames as well as receive them. Such switches are usually marketed with acronyms like "DCB" (DataCenter Bridging) or "CEE" (Converged Enhanced Ethernet).
Obviously, this coarse-grained approach will pause all traffic over the link - including any IP or FCoE traffic. As this can have a negative impact on non-RoCE performance, Cisco has proposed Priority Flow Control (PFC, now covered under IEEE 802.1Qbb). This is a PAUSE frame with a special payload, indicating which Ethernet QoS class should be paused. This is accompanied by other protocols, to negotiate QoS values on either end of a link (i.e., between NIC and switch).
Finally, all types of traffic on the link will have different Ethernet frame types (as described by IANA). IPv4, IPv6, FCoE, and RoCE all have different ID values.
While RoCE is supported by OFED, as of OFED 1.5.3 it isn't completely stable. You'll want to use Mellanox's OFED - version 1.5.3 or higher. Stock OFED will work fine for small tests, but large applications will have a tendency to crash.
PFC is a pain. The tools to auto-negotiate may not exist for RoCE - the only documentation I've found was limited to FCoE. Avoid it if at all possible.
Somehow, you'll need to classify RoCE traffic as lossless. Here's some suggestions, in my order of preference:
options rdma_cm def_prec2sl=4in
/etc/modprobe.d(Obviously, I'm using the value
/etc/mv2.conf, so MVAPICH2 will know what IP address to try for RoCE connections
If you can, stop reading and go buy some Infiniband adapters. You'll save a considerable amount of staff time.
Fine. Keep reading. But don't say I didn't warn you.
The Nexus 5000-series and the Nexus 7000-series switches are completely different products. The interface to building lossless queues is different, the command syntax is different, and different values can be used for lossless traffic classes on each series of switches. If you have environments with both, you'll be picking different QoS values.
The Nexus 7000 platform only supports lossless queuing on the newest "F" boards - the fabric boards that have no routing abilities. You'll want to buy those, if you plan on having stable RoCE.
Finally, be wary of ANY firmware updates. We've had a functional RoCE configuration on a Nexus 7000 switch, using firmware 5.1(3), using the third method above. That broke, however, when we upgraded to 5.1(5). Something changed in the default queuing config, and since you can only build on the default lossless queue config (rather than nuke it and define your own), you are subject to changes in the default. In our case, RoCE performance dropped to 30 Mbps (down from 9.91 Gbps). All wasn't lost, though - after the upgrade, all traffic was lossless (except what we'd previously tagged via QoS, of course). We just stopped using QoS, and now have reliable Ethernet. Absolutely bizarre.
Making this work depends on how RoCE traffic was classified. If RoCE Ethertypes are lossless, or if all traffic is lossless (options #1 or #2, above) any RDMA application should just work - the RoCE adapter presents as an Infiniband HCA.
If you picked option #3, you'll need to jump through some extra hoops. First,
def_prec2sl module parameter and
as described above. At this point, MVAPICH2 applications should work. For
OpenMPI, you'll need to use OpenMPI 1.4.4 or 1.5.4 or newer. They need
additional command-line options to set the IB service level and the IP address
-mca btl_openib_ib_service_level <number> and
-mca btl_openib_ipaddr_include <ipaddr>, respectively.
These can be baked into a config file (like
in your OpenMPI's
share directory). Note that
btl_openib_ipaddr_include can take CIDR notation for a subnet to
match, so you can use the same config file for all nodes in a cluster.
In theory, it may be possible to use RoCE for non-MPI applications - including kernel-level things like Lustre. I'd only attempt this if options #1 or #2 are in use, though - setting extra VLANs, non-default GIDs, and custom IB service levels (mapped to Ethernet QoSes) is likely to be hard to integrate in anything other than OpenMPI and MVAPICH2.
There isn't a lot of documentation (practically zero, outside of Mellanox) on RoCE. Any useful links I can find will be added here.