RoCE - RDMA over Converged Ethernet

Mike Shuey (shuey@fmepnet.org)

A personal obsession


Introduction

Currently, I work for a mid-sized high-performance computing (HPC) shop. For many of the scientific codes we run, communication performance matters - both in terms of inter-machine (a.k.a., inter-node) bandwidth and latency. Like most HPC shops, we have some experience with Infiniband, but in recent years we've been using 10 Gbps Ethernet (10gigE) for a cluster interconnect. Given ethernet's prevalence, and general dominance in datacenter networking, 10gigE seems on the surface to be a general win, and a decent choice for a cluster interconnect (particularly for a user base that historically prefers gigabit ethernet for cost reasons).

I've designed three 10gigE clusters, two of which are on the current (November, 2011) Top 500 list. I do not recommend this. 10gigE has its place, but currently economics favor Infiniband for high-performance computing. If your code uses MPI, and you need more cores than you can fit in one compute node (and your code isn't embarassingly parallel - I've seen some that could operate nicely over 10 Mbps ethernet), you should be looking at Infiniband.

Rather than delving into why I've been building 10gigE clusters, this page discusses modern technology that can help you get the most performance from a high-speed ethernet fabric. Be warned, the content from here on out gets technical quickly. I've likely spent more time than is healthy examining this space, and doing so requires a fair amount of expertise in TCP, IP, ethernet, Infiniband (as well as general RDMA theory, and its multiple incarnations), operating systems, MPI libraries, and several vendors' product lines.

To quote the xterm source code: "There be dragons here."


Defining "slow", and Why Plain TCP/IP is Bad

TCP/IP is great, for most things - but the API pretty much requires kernel intervention. Your app calls socket() and write(), some library fires off a syscall, and the kernel starts formatting data to go over the wire. Under Linux, a null syscall has an overhead of around 1000 instructions (if you'll pardon the blind assertion), so that means you can do around 2.5 million syscalls per second on a 2.5 GHz CPU (using some vague hand-waving to avoid calulating effects of load-store queuing and superscalar processors). On paper, that means a hard max of around 30 Gbps of throughput - more, with frame sizes over 1500 bytes.

Unfortunately, that's not reality. First off, a processor will need to do some data formatting and copying beyond the time to enter the syscall. Second, data arriving will also trigger syscalls. Some of this can be ameliorated (e.g., jumbo frames, interrupt coalescing, etc.) but at a cost of tying up a processor to handle the kernel's side of the communication. If your application requires frequent data exchange (like most HPC simulations), the added latency and processor overhead can greatly degrade performance - even without fully utilizing the available bandwidth.

TOE NICs

TOE (TCP Offload Engine) NICs may help, to a limited degree. A TOE will reduce the CPU's workload, but won't significantly reduce overall message latency - unless the TOE vendor comes with a wrapper library to replace the sockets API (Solarflare does this, for example).

iWARP

If you need to do RDMA over Ethernet, this is the easiest way to do it. It's not quite Infiniband, but many of the various IB-related commands in OFED will work. Many RDMA apps will work with this, and as iWARP is encapsulated by TCP/IP it can transit a router. Latency will be higher than RoCE (at least with both Chelsio and Intel/NetEffect implementations), but still well under 10 μs. iWARP is reasonably stable with recent versions of the OpenFabrics stack - in-kernel drivers may not be as stable (including those baked into Redhat Enterprise 5 and 6). Caveat emptor.


RoCE

RoCE is RDMA over Converged Ethernet - but Infiniband over Ethernet would be a more apt description. Strip the GUIDs out of the IB header, replace them with Ethernet MAC addresses, and send it over the wire. As of this writing, only Mellanox (www.mellanox.com) makes RoCE-capable equipment (their CX2 and CX3 line of products).

Infiniband is a lossless physical-layer protocol, so RoCE requires lossless Ethernet. Also, since it's Ethernet, RoCE cannot transit a router. It's strictly a layer-2 protocol, and it needs a complicated layer-2.

Lossless Ethernet: a Quick Review

Ethernet becomes lossless by re-using 802.1D PAUSE frames for explicit flow control. This is timing-sensitive; a receiver must send a PAUSE soon enough such that it is received and processed before the receive buffer can fill. Obviously, there are issues stretching this over some distance. Switches must be internally lossless, and must be able to send PAUSE frames as well as receive them. Such switches are usually marketed with acronyms like "DCB" (DataCenter Bridging) or "CEE" (Converged Enhanced Ethernet).

Obviously, this coarse-grained approach will pause all traffic over the link - including any IP or FCoE traffic. As this can have a negative impact on non-RoCE performance, Cisco has proposed Priority Flow Control (PFC, now covered under IEEE 802.1Qbb). This is a PAUSE frame with a special payload, indicating which Ethernet QoS class should be paused. This is accompanied by other protocols, to negotiate QoS values on either end of a link (i.e., between NIC and switch).

Finally, all types of traffic on the link will have different Ethernet frame types (as described by IANA). IPv4, IPv6, FCoE, and RoCE all have different ID values.

Reality

While RoCE is supported by OFED, as of OFED 1.5.3 it isn't completely stable. You'll want to use Mellanox's OFED - version 1.5.3 or higher. Stock OFED will work fine for small tests, but large applications will have a tendency to crash.

PFC is a pain. The tools to auto-negotiate may not exist for RoCE - the only documentation I've found was limited to FCoE. Avoid it if at all possible.

Somehow, you'll need to classify RoCE traffic as lossless. Here's some suggestions, in my order of preference:

  1. Discriminate RoCE traffic by Ethertype - RoCE packets would be treated losslessly, and non-RoCE traffic could be dropped (during congestion).
  2. Classify ALL traffic as lossless (and deal with the performance impact, if any, on non-RoCE traffic).
  3. Assign a QoS class for lossless traffic. Unfortunately, Mellanox adapters will only emit a QoS when they emit a VLAN tag, so you'll need to do the following:

So you have Cisco Nexus switches...

If you can, stop reading and go buy some Infiniband adapters. You'll save a considerable amount of staff time.

Fine. Keep reading. But don't say I didn't warn you.

The Nexus 5000-series and the Nexus 7000-series switches are completely different products. The interface to building lossless queues is different, the command syntax is different, and different values can be used for lossless traffic classes on each series of switches. If you have environments with both, you'll be picking different QoS values.

The Nexus 7000 platform only supports lossless queuing on the newest "F" boards - the fabric boards that have no routing abilities. You'll want to buy those, if you plan on having stable RoCE.

Finally, be wary of ANY firmware updates. We've had a functional RoCE configuration on a Nexus 7000 switch, using firmware 5.1(3), using the third method above. That broke, however, when we upgraded to 5.1(5). Something changed in the default queuing config, and since you can only build on the default lossless queue config (rather than nuke it and define your own), you are subject to changes in the default. In our case, RoCE performance dropped to 30 Mbps (down from 9.91 Gbps). All wasn't lost, though - after the upgrade, all traffic was lossless (except what we'd previously tagged via QoS, of course). We just stopped using QoS, and now have reliable Ethernet. Absolutely bizarre.

Making this all work for practical apps

Making this work depends on how RoCE traffic was classified. If RoCE Ethertypes are lossless, or if all traffic is lossless (options #1 or #2, above) any RDMA application should just work - the RoCE adapter presents as an Infiniband HCA.

If you picked option #3, you'll need to jump through some extra hoops. First, set the def_prec2sl module parameter and /etc/mv2.conf as described above. At this point, MVAPICH2 applications should work. For OpenMPI, you'll need to use OpenMPI 1.4.4 or 1.5.4 or newer. They need additional command-line options to set the IB service level and the IP address to use: -mca btl_openib_ib_service_level <number> and -mca btl_openib_ipaddr_include <ipaddr>, respectively. These can be baked into a config file (like openmpi-mca-params.conf in your OpenMPI's share directory). Note that btl_openib_ipaddr_include can take CIDR notation for a subnet to match, so you can use the same config file for all nodes in a cluster.

In theory, it may be possible to use RoCE for non-MPI applications - including kernel-level things like Lustre. I'd only attempt this if options #1 or #2 are in use, though - setting extra VLANs, non-default GIDs, and custom IB service levels (mapped to Ethernet QoSes) is likely to be hard to integrate in anything other than OpenMPI and MVAPICH2.

Additional Resources

There isn't a lot of documentation (practically zero, outside of Mellanox) on RoCE. Any useful links I can find will be added here.


$Date: 2012/01/02 08:35:37 $