Advanced Linux network performance tuning

2014-07-03

Marek Obuchowicz

Preface

While supporting customer traffic infrastructure, we faced a lot of improvement room in the area of Linux networking. How did it all start? The traffic nodes in several datacenters have been serving vast amount of network traffic. However, we found out that we could get only up to 30% percent of maximal network throughput. At this time, none of the typical metrics (CPU, disk throughput, memory) seemed to be the bottleneck. In such situation the typical problem source could be software itself. But let’s not forget about operating system, which can be tuned a lot. Especially when it comes to Linux – the default settings are usually working for typical cases, but there is much room for improvement here.

Do I need to tune my network settings?

Usually not. Recent Linux kernels and distributions provide carefully selected defaults, which are tuned to typical use cases and perform really well. However not all servers run in typical environments. Sometimes the default values need to be reviewed and adjusted to be able to get more from your resources.

The improvements described below might help if your machine is under high network pressure. What does it mean? Traffic of few hundred mbit/s and higher. This happens usually on machines with roles such as firewall, load balancer and busy webserver. Concerning database and application servers (like webservers spending most time running your PHP/Java/Ruby/Python/… code), typical bottlenecks are CPU and I/O – there’s no real need to tune networking there, as the default values perform just great. One useful change might be making sure that the backlog is big enough for web/application servers (controlled by sysctl net.core.somaxconn, see below).

Firewall with iptables

In most cases, I would recommend running services on servers in isolated network. It means that they should not be accessible directly from Internet. For security and performance reasons, it’s most common to keep just firewalls (doing NAT), VPN gateways and load balancers open to Internet. But if it’s not the case and our machines have public IP addresses – firewall is a must-have. IPtables is great tool, but it is pretty easy to do some performance mistakes here. While dealing with Linux firewall, consider following points:

Place the most commonly used rules at the beginning of the ruleset.

It is pretty easy to figure out, which rules have been triggered how many times. Simply, execute “iptables –L –n –v” on running machine and take a look at the counters in first column – they indicate how many packets have matched this rule. If possible (remember that changing rules order might affect the logic of filtering!), try to move the rules with the highest counter values to the beginning of the table:

# iptables -L -n -v
Chain INPUT (policy ACCEPT 0 packets, 0 bytes)

 pkts bytes target     prot opt in     out     source               destination        
23370 4120K ACCEPT     all  --  lo     *       0.0.0.0/0            0.0.0.0/0
 9568  536K ACCEPT     icmp --  *      *       0.0.0.0/0            0.0.0.0/0
 103K  90M  ACCEPT     all  --  *      *       0.0.0.0/0            0.0.0.0/0 state RELATED,ESTABLISHED

Try to make it simple.

Avoid mangling, redirects, translations and other fancy features if you don’t really need them. Pretty common case for using redirects is running services as non-root and need to bind them to privileged ports (<1024). You can bind the service to higher port number and use redirect to pass the packets coming to privileged port. But there’s a simplier way to do that. Use can use Linux kernel “capabilities”. The “setcap” command can add the capability of binding to privileged port as non-root to specific binaries. For example, to allow nginx server to do that, use following command:

# setcap cap_net_bind_service+ep /usr/bin/nginx

Using this feature, we can for example allow our web application server, running as ordinary user, to bind to port 80/443

Consider disabling connection tracking

The nf_conntrack kernel module is required for features like NAT and stateful connection tracking. But it’s not required for basic TCP filtering. You can replace the rule which allows packets in ESTABLISHED state to simply checking for TCP flags. The first packet sent by client trying to establish a TCP session has the SYN flag set. All other packets of the session have SYN flag disabled. Therefore, we can allow all TCP packets without SYN flag set to be passed. It has slightly different effect as stateful filtering - it will allow a malformed packet that doesn’t below to any existing TCP session to pass the filter. But it’s fine - those packets won’t do any harm, as they will be dropped silently by the kernel.

Sysctl kernel tunables

There are thousands of ready “recipes” for network sysctl tuning on the internet – just try to search for them and you should get a bunch of results. But remember: one of the worst things you could do is to tune performance parameters in a “blind” way, without understanding what they do and not measuring the impact on performance change. The chances that we will do “good” are equal to chances of making things worse! I’ve seen many examples, which provided totally wrong values as an “ultimate” fix. Never trust them. The successful way to tune performance parameters, sysctl’s among them, is:

Understanding impact of changing values: So, RTFM… Before you change any value, spend your time and read the documentation. Probably in case of most settings, you won’t need touching them. The key is that you understand what you do and what impact on running system the change will cause
Benchmarking: Never change tunable value without checking the results. Make sure that you know your system performance before and after the change. Change only one parameter or one consistent group of parameters at a time. Compare the results. Did you change anything? Is it better or worse? The good idea might be to stress test the system, so generate as much traffic as you have to hit the bottleneck. Measuring idle system usually gives no differences in results.

As I mentioned above, it’s usually better to spend some time and figure out what really works for your system than applying different recipes which might work for others, but not necessary for you. However I think it might be helpful to list the sysctl’s that usually have impact on network performance:

net.nf_conntrack_max: maximum number of tracked connection in iptables
net.ipv4.ip_local_port_range: range of ports used for outgoing TCP connections (useful to change it if you have a lot of outgoing connections from host)
net.ipv4.tcp_rmem: autotuning tcp receive buffer parameters
net.ipv4.tcp_wmem: autotuning tcp send buffer parameters
net.ipv4.rmem_max: maximum tcp socket receive buffer memory size (in bytes)
net.ipv4.wmem_max: maximum tcp socket send buffer memory size (in bytes)
net.ipv4.tcp_mem: TCP buffer memory usage thresholds for autotuning, in memory pages (1 page = 4kb)
net.core.somaxconn: maximum listen queue size for sockets (useful and often overlooked setting for loadbalancers, webservers and application servers (like unicorn, php-fpm). If all server processes/threads are busy, then incoming client connections are put in “backlog” waiting for being served). Full backlog causes client connections to be immediately rejected, causing client error.
net.core.dev_weight: maximum number of frames that kernel may drain from device queue during one interrupt cycle (setting higher value will allow more packets to be processed during one softirq cycle, but longer gaps between CPU availability for applications)
net.core.netdev_max_backlog: number of slots in the receiver's ring buffer for arriving packets (kernel put packets in this queue if the CPU is not available to process them, for example by application)
net.ipv4.tcp_keepalive_time: interval in seconds between last packet sent and first keepalive probe sent by linux
net.ipv4.tcp_keepalive_probes: number of unacknowledged keepalive probe packets to send before considering the connection dead

MTU

MTU is maximum frame size allowed to be sent over the interface. The maximal value working depends on specific hardware link we use – it’s different for Ethernet, DSL, InfitintiBand etc. The maximum value that should be used in Internet is 1500 and this is the default Linux setting for Ethernet interfaces.

The bigger single frame is, the fewer amount of frames (and packets) server needs to send to transfer specific amount of data. As each packet includes some overhead data and requires processing, it’s beneficial to send as little packets as possible. If a packet is bigger than MTU of the approached uplink, it will be either fragmented into smaller packets or not transmitted at all – both scenarios result in taking some additional action by network devices and in effect - degradation of performance. The default value (1500) is optimal for all Internet-facing interfaces. However, if you are setting up private networks – which is pretty common in case of server farms – you are using only Ethernet as your link layer and it means you can use larger frames. So called “Jumbo Frames” – typically with size of 9000 bytes – are supported by most gigabit/10gigabit switches and network interfaces nowadays. Switching MTU to 9k works on most private networks and offers pretty big performance gain. Just make sure that:

you enable jumbo frames only on private network, not on public interfaces connected to internet
jumbo frames are enabled on all servers within your private network

To enable jumbo frames, use the command:

# ifconfig eth0 mtu 9000 (replace eth0 with the correct interface name)

To make changes persistant upon reboots on Debian/Ubuntu add the line:

post-up ifconfig $IFACE mtu 9000

to corresponding interface configuration in /etc/network/interfaces.

Network interface settings

To be able to check or modify your Ethernet interface physical settings, you need the tool called “ethtool” (available as package in most distributions, but often not installed by default). You can use ethtool to check your interface settings:

# ethtool eth0
Settings for eth0:
         Supported ports: [ TP ]
         Supported link modes:   10baseT/Half 10baseT/Full
                                 100baseT/Half 100baseT/Full
                                 1000baseT/Full
         Supported pause frame use: No
         Supports auto-negotiation: Yes
         Advertised link modes:  10baseT/Half 10baseT/Full
                                 100baseT/Half 100baseT/Full
                                 1000baseT/Full
         Advertised pause frame use: No
         Advertised auto-negotiation: Yes
         Speed: 1000Mb/s
         Duplex: Full
         Port: Twisted Pair
         PHYAD: 0
         Transceiver: internal
         Auto-negotiation: on
         MDI-X: off (auto)
         Supports Wake-on: umbg
         Wake-on: d
         Current message level: 0x00000007 (7)
                                  drv probe link
         Link detected: yes

Most NIC’s and switches should not need any improvements and automatically pick best port settings, but make sure that your speed and duplex are set to highest possible settings. Ask your provider for that, but in most cases you should be running 1000Mb/s in full-duplex mode (if you have Gigabit Ethernet connection).

You can also use ethtool to get/set your ring buffer settings. Bigger ring buffers mean that more data frames can be processed while handling single interrupt. This offers you better network performance under high traffic, but will increase latency. If you feel you want to play with them, make sure to do proper benchmarking before and after!

# ethtool -g eth0
Ring parameters for eth0:

Pre-set maximums:
RX:               4096
RX Mini:          0
RX Jumbo:         0
TX:               4096

Current hardware settings:
RX:               256
RX Mini:          0
RX Jumbo:         0
TX:               256

As you see in example above, current buffers are much smaller than maximum. Increasing them to 512, 1020 or 2040 frames have resulted several times in better throughput, in cost of latency (which was still at acceptable levels).