Bandwidth Delay Product Tuning & You


Wide-area networks abound, and fast networks abound (where fast is > 200 Mbps),
but your average consumer will never deal with both at the same time. As of
this writing (late 2013), typical US broadband connections are 40 Mbps or less.
Generally less. Most operating systems seem to be tuned to work acceptably
well across the Internet at these speeds by default. Unfortunately, users of
faster links (like gigabit ethernet, or 10 or 40 Gbps ethernet) are often left
at a loss to explain why their network connections seem amazingly fast on a
local connection (intra-building, across a small academic campus, etc.) but
fall to rather paltry speeds when covering any sort of distance. In my
experience, users generally chalk this up to “the network is slow”, and live
with it. If some network support engineers (ISP, corporate network group,
whatever) is engaged, you usually get some sort of finger-pointing; all sides
have plenty of evidence that both the client, server, and network are operating
just fine, thank you, and that something else must be broken.


In many cases, TCP itself is the limiting factor. TCP must be lossless, even
in the face of packet losses, retransmissions, and corruption. To support
that, a TCP implementation (read: your operating system kernel) must save
every byte of data it transmits until the recipient has explicitly acknowledged
it. If a packet is lost, the recipient will fail to acknowledge (ACK) it (or
will repeatedly ACK the last byte it did receive); the sender can use its
stored copy to re-transmit the missing data. So how big does this buffer
need to be, anyway? Yeah, that would be the
<a href=http://en.wikipedia.org/wiki/Bandwidth-delay_product>bandwidth delay
product.


Bits do not propagate instantly - the speed of light is a constant. That means
a sender must buffer enough data for its network adapter to run at full speed
while waiting for the full round-trip delay to the recipient. The round-trip
delay can be measured via the UNIX ping command; typical values are
in tens of miliseconds. Multiply the bandwidth and the time (in seconds) for
a round trip, and you’ve got the amount of buffer space needed to keep a
connection busy at that distance. For example, a 1 Gbps network connection
with a 54 ms ping latency (say, from the midwest to the west coast), we
require 1 Gb/s * 0.054 s = 54 Mb = 6.75 MB of buffer space. Obviously, a
10 Gbps ethernet connection (and appropriate routers) would require 67.5 MB
of buffer to fill the available bandwidth.


The remainder of this document outlines how to tune TCP stacks in a couple OSes
for high bandwidth delay product communication. There’s a wide array of
OS-specific TCP and IP tuning parameters; here, I’m only focusing on the ones
related to long-haul TCP sessions. For more info, check out the links
referenced below.

Linux


Linux’s TCP stack includes tunables for overall maximum socket memory, as well
as a three-part value for send and receive, listing minimum, initial, and
maximum memory use. There are many other tunables, but as of RedHat Enterprise
6 (kernel 2.6.32 or so) most of these default to usable values for a 10 Gbps
WAN connection. The socket memory settings, however, default to a maximum of
4 MB of buffer space - probably far too small for modern WAN things.


TCP tunables are controlled via sysctl (read the man page). Add
the following to /etc/sysctl.conf:

net.core.rmem_max = 524288000
net.ipv4.tcp_rmem = 8192 262144 131072000
net.core.wmem_max = 524288000
net.ipv4.tcp_wmem = 8192 262144 131072000


The rmem_max line allows up to 0.5 GB of memory to be used for a
socket. Technically, this is way overkill, as the next line (for
tcp_rmem) will limit this to 128 MB max (and 8 kB minimum, with
a default of 256 kB). If 128 MB proves insufficient, simply raise this third
value. Both are repeated for wmem (memory for a sending socket).

Apple OS X


Apple’s TCP stack is BSD-derived. It also uses sysctl for tuning,
but has different tunables from Linux. Total socket memory is limited by
the maxsockbuf parameter; unfortunately, as of OS 10.9, this is
limited to a mere 6 MB - and that must be split (statically!) between send
and receive memory. Honestly, that’s just not enough for long-distance
transfers, but we’ll make the most of it that we can.


Currently, I’m recommending these lines in /etc/sysctl.conf:

kern.ipc.maxsockbuf=6291456
net.inet.tcp.sendspace=3145728
net.inet.tcp.recvspace=3145728
net.inet.tcp.doautorcvbuf=0
net.inet.tcp.doautosndbuf=0
net.inet.tcp.mssdflt=1460
net.inet.tcp.win_scale_factor=8
net.inet.tcp.delayed_ack=0

  • kern.ipc.maxsockbuf: This is the maximum amount of memory to
    use for a socket, including both read and write memory. Again, in 10.9, this is limited
    to 6 MB (and defaults to 6 MB) - rather disappointing, Apple. Note that this
    probably also affects SYSV IPC sockets (though, that’s unlikely to make a
    major difference for anyone).

  • net.inet.tcp.sendspace: Allow for up to 3 MB of memory for a send buffer.
    This, plus net.inet.tcp.recvspace, must be less than maxsockbuf.

  • net.inet.tcp.recvspace: Allow for up to 3 MB of memory for a receive buffer.
    This, plus net.inet.tcp.sendspace, must be less than maxsockbuf.

  • net.inet.tcp.doautorcvbuf,doautosndbuf: MacOS has a mechanism
    for auto-tuning buffer sizes. By default, this is limited to 512 kB for each
    of send and receive. Setting these to 0 will disable the buffer
    auto-tuning entirely.

  • net.inet.tcp.autorcvbufmax,autosndbufmax: If you’d rather
    keep the auto-tuning buffer logic enabled (see above), you’ll want to raise
    this maximum. The default (at least in 10.9) is 512 KB; a value of 3 MB
    (3145728) is more appropriate, and will allow your machine to hit higher
    transfer speeds. I suggest tuning this if your machine handles a lot of TCP
    connections. Most users probably won’t care, but at up to 6 MB per TCP
    connection, you could burn through memory quickly if you’ve got hundreds of
    connections in progress.

  • net.inet.tcp.autorcvbufinc,autosndbufinc: Based on the name,
    I suspect this determines how aggressively buffer auto-tuning ramps up to its
    full buffer size. It defaults to 8 KB; if you do use buffer auto-tuning, and
    if you see poor performance on short-lived connections (but better performance
    on TCP transfers that take at least a couple minutes to complete), you might
    try increasing this value by a factor of 10-20.

  • net.inet.tcp.mssdflt: Yeah, this should be higher. MacOS
    defaults to 512 bytes for its maximum segment size (the largest packet it will
    attempt to send). “Normal” ethernet frames are up to 1500 bytes (and there
    are specs for yet larger packets). 512 bytes is appropriate for modems, but
    not for anything faster (and that includes cable modems). If you’re using
    ethernet, I’d recommend 1460 (that’s a 1500-byte ethernet frame, minus 40 bytes
    of TCP/IP headers). If your ethernet goes through a PPPoE relay (e.g., DSL,
    and maybe some cable modems) you probably want 1440 (to account for 20 bytes of
    PPPoE framing data). Note that this doesn’t really make your connection
    faster - you just use fewer packets (and therefore fewer network resources)
    to get the job done.

  • net.inet.tcp.win_scale_factor: Most TCP implementations
    automatically calculate the window scale factor. In case MacOS doesn’t, I
    set this to 8 - though I’m not certain this is required. Try it, try omitting
    it, see if there’s any difference. If you’re wondering what a window scale
    factor is, I suggest reading the <a href=http://en.wikipedia.org/wiki/Bandwidth-delay_product>wikipedia page. Essentially, it controls how large a buffer
    your machine can advertise to the other side of the TCP connection.

  • net.inet.tcp.delayed_ack: Delayed ACKs are generally a good
    idea - wait until a few packets have arrived, and acknowledge them all at once.
    Fewer reply packets, less network traffic, etc. This can result in slightly
    higher latency (since the receiver waits slightly for multiple packets to
    arrive, even if only one is on the wire). Worse still, in some not-so-rare
    circumstances, this can interact very badly with Nagle’s algorithm
    (a similar sender-side optimization) - so much so that you can get several
    orders of magnitude worse performance, with no obvious reason why. If you
    suspect this is a problem, turn it off; for more information, look
    here.

References, Next Steps


There are a plethora of TCP tuning guides out there. If you’re tuning to a
specific application, or with certain high-end hardware (in particular,
Mellanox 10 and 40 Gbps adapters), I’d recommend looking at ethtool
settings as well.