Bandwidth Delay Product Tuning & You

2013-12-01

Wide-area networks abound, and fast networks abound (where fast is > 200 Mbps),
but your average consumer will never deal with both at the same time. As of
this writing (late 2013), typical US broadband connections are 40 Mbps or less.
Generally less. Most operating systems seem to be tuned to work acceptably
well across the Internet at these speeds by default. Unfortunately, users of
faster links (like gigabit ethernet, or 10 or 40 Gbps ethernet) are often left
at a loss to explain why their network connections seem amazingly fast on a
local connection (intra-building, across a small academic campus, etc.) but
fall to rather paltry speeds when covering any sort of distance. In my
experience, users generally chalk this up to “the network is slow”, and live
with it. If some network support engineers (ISP, corporate network group,
whatever) is engaged, you usually get some sort of finger-pointing; all sides
have plenty of evidence that both the client, server, and network are operating
just fine, thank you, and that something else must be broken.

In many cases, TCP itself is the limiting factor. TCP must be lossless, even
in the face of packet losses, retransmissions, and corruption. To support
that, a TCP implementation (read: your operating system kernel) must save
every byte of data it transmits until the recipient has explicitly acknowledged
it. If a packet is lost, the recipient will fail to acknowledge (ACK) it (or
will repeatedly ACK the last byte it did receive); the sender can use its
stored copy to re-transmit the missing data. So how big does this buffer
need to be, anyway? Yeah, that would be the
<a href=http://en.wikipedia.org/wiki/Bandwidth-delay_product>bandwidth delay
product.

Bits do not propagate instantly - the speed of light is a constant. That means
a sender must buffer enough data for its network adapter to run at full speed
while waiting for the full round-trip delay to the recipient. The round-trip
delay can be measured via the UNIX ping command; typical values are
in tens of miliseconds. Multiply the bandwidth and the time (in seconds) for
a round trip, and you’ve got the amount of buffer space needed to keep a
connection busy at that distance. For example, a 1 Gbps network connection
with a 54 ms ping latency (say, from the midwest to the west coast), we
require 1 Gb/s * 0.054 s = 54 Mb = 6.75 MB of buffer space. Obviously, a
10 Gbps ethernet connection (and appropriate routers) would require 67.5 MB
of buffer to fill the available bandwidth.

The remainder of this document outlines how to tune TCP stacks in a couple OSes
for high bandwidth delay product communication. There’s a wide array of
OS-specific TCP and IP tuning parameters; here, I’m only focusing on the ones
related to long-haul TCP sessions. For more info, check out the links
referenced below.

Linux

Linux’s TCP stack includes tunables for overall maximum socket memory, as well
as a three-part value for send and receive, listing minimum, initial, and
maximum memory use. There are many other tunables, but as of RedHat Enterprise
6 (kernel 2.6.32 or so) most of these default to usable values for a 10 Gbps
WAN connection. The socket memory settings, however, default to a maximum of
4 MB of buffer space - probably far too small for modern WAN things.

TCP tunables are controlled via sysctl (read the man page). Add
the following to /etc/sysctl.conf:

net.core.rmem_max = 524288000
net.ipv4.tcp_rmem = 8192 262144 131072000
net.core.wmem_max = 524288000
net.ipv4.tcp_wmem = 8192 262144 131072000

The rmem_max line allows up to 0.5 GB of memory to be used for a
socket. Technically, this is way overkill, as the next line (for
tcp_rmem) will limit this to 128 MB max (and 8 kB minimum, with
a default of 256 kB). If 128 MB proves insufficient, simply raise this third
value. Both are repeated for wmem (memory for a sending socket).

Apple OS X

Apple’s TCP stack is BSD-derived. It also uses sysctl for tuning,
but has different tunables from Linux. Total socket memory is limited by
the maxsockbuf parameter; unfortunately, as of OS 10.9, this is
limited to a mere 6 MB - and that must be split (statically!) between send
and receive memory. Honestly, that’s just not enough for long-distance
transfers, but we’ll make the most of it that we can.

Currently, I’m recommending these lines in /etc/sysctl.conf:

kern.ipc.maxsockbuf=6291456
net.inet.tcp.sendspace=3145728
net.inet.tcp.recvspace=3145728
net.inet.tcp.doautorcvbuf=0
net.inet.tcp.doautosndbuf=0
net.inet.tcp.mssdflt=1460
net.inet.tcp.win_scale_factor=8
net.inet.tcp.delayed_ack=0

kern.ipc.maxsockbuf: This is the maximum amount of memory to
use for a socket, including both read and write memory. Again, in 10.9, this is limited
to 6 MB (and defaults to 6 MB) - rather disappointing, Apple. Note that this
probably also affects SYSV IPC sockets (though, that’s unlikely to make a
major difference for anyone).

net.inet.tcp.sendspace: Allow for up to 3 MB of memory for a send buffer.
This, plus net.inet.tcp.recvspace, must be less than maxsockbuf.

net.inet.tcp.recvspace: Allow for up to 3 MB of memory for a receive buffer.
This, plus net.inet.tcp.sendspace, must be less than maxsockbuf.

net.inet.tcp.doautorcvbuf,doautosndbuf: MacOS has a mechanism
for auto-tuning buffer sizes. By default, this is limited to 512 kB for each
of send and receive. Setting these to 0 will disable the buffer
auto-tuning entirely.

net.inet.tcp.autorcvbufmax,autosndbufmax: If you’d rather
keep the auto-tuning buffer logic enabled (see above), you’ll want to raise
this maximum. The default (at least in 10.9) is 512 KB; a value of 3 MB
(3145728) is more appropriate, and will allow your machine to hit higher
transfer speeds. I suggest tuning this if your machine handles a lot of TCP
connections. Most users probably won’t care, but at up to 6 MB per TCP
connection, you could burn through memory quickly if you’ve got hundreds of
connections in progress.

net.inet.tcp.autorcvbufinc,autosndbufinc: Based on the name,
I suspect this determines how aggressively buffer auto-tuning ramps up to its
full buffer size. It defaults to 8 KB; if you do use buffer auto-tuning, and
if you see poor performance on short-lived connections (but better performance
on TCP transfers that take at least a couple minutes to complete), you might
try increasing this value by a factor of 10-20.

net.inet.tcp.mssdflt: Yeah, this should be higher. MacOS
defaults to 512 bytes for its maximum segment size (the largest packet it will
attempt to send). “Normal” ethernet frames are up to 1500 bytes (and there
are specs for yet larger packets). 512 bytes is appropriate for modems, but
not for anything faster (and that includes cable modems). If you’re using
ethernet, I’d recommend 1460 (that’s a 1500-byte ethernet frame, minus 40 bytes
of TCP/IP headers). If your ethernet goes through a PPPoE relay (e.g., DSL,
and maybe some cable modems) you probably want 1440 (to account for 20 bytes of
PPPoE framing data). Note that this doesn’t really make your connection
faster - you just use fewer packets (and therefore fewer network resources)
to get the job done.

net.inet.tcp.win_scale_factor: Most TCP implementations
automatically calculate the window scale factor. In case MacOS doesn’t, I
set this to 8 - though I’m not certain this is required. Try it, try omitting
it, see if there’s any difference. If you’re wondering what a window scale
factor is, I suggest reading the <a href=http://en.wikipedia.org/wiki/Bandwidth-delay_product>wikipedia page. Essentially, it controls how large a buffer
your machine can advertise to the other side of the TCP connection.

net.inet.tcp.delayed_ack: Delayed ACKs are generally a good
idea - wait until a few packets have arrived, and acknowledge them all at once.
Fewer reply packets, less network traffic, etc. This can result in slightly
higher latency (since the receiver waits slightly for multiple packets to
arrive, even if only one is on the wire). Worse still, in some not-so-rare
circumstances, this can interact very badly with Nagle’s algorithm
(a similar sender-side optimization) - so much so that you can get several
orders of magnitude worse performance, with no obvious reason why. If you
suspect this is a problem, turn it off; for more information, look
here.

References, Next Steps

There are a plethora of TCP tuning guides out there. If you’re tuning to a
specific application, or with certain high-end hardware (in particular,
Mellanox 10 and 40 Gbps adapters), I’d recommend looking at ethtool
settings as well.

ES Net tuning guide

Scott’s blog post on Mac OS tweaks

TCP Performance problems caused by interaction between Nagle’s Algorithm and Delayed ACK