This is a continuation of a post I started back here. Please read it first before starting below.
Another workaround we can use is Packetization Layer Path MTU Discovery – RFC 4821. The RFC enables a host to mainly acts in one of two ways:
- Use regular PMTUD. If no acknowledgments are received and no ICMP messages are received, start to probe.
- Ignore regular PMTUD and always probe.
Probing is where the host will send a packet with the min MTU configured and then attempt to increase that size. If acknowledgements are received on the larger size, then try increase it again. Option 1 will wait for a timeout so on broken PMTUD paths it starts a bit slow. It will however use regular PMTUD whenever it can so it’s a lot more efficient. Option 2 simple probes all the time. It starts a bit quicker on smaller MTU paths, but the server is also sending smaller packets to ALL paths in the beginning. Much less efficient.
I’ll configure this in Debian and then go through Wireshark to show what’s going on. Add the commands net.ipv4.tcp_mtu_probing = 1 to /etc/sysctl.conf then reload sysctl:
[email protected]:~# sysctl -p net.ipv4.tcp_mtu_probing = 1
Start the transfer and what does Wireshark show us:
After the standard 3-way handshake, the server sends a number of 1514 byte packets. ICMP has been blocked and as such there are no ICMP fragmentation needed messages coming from R2. After 5.3 seconds the server sends a number of 578 byte packets.
These get ACK’d correctly:
0.5 seconds later the server sends a single 1090 byte packets and fill the rest of the window with 578 byte packets. As soon as the ACK for that big packet comes back, the server sends all of its packets at 1090:
A couple of things to note about this setting in Ubuntu 14.04 and Debian 7.6.0:
- The system does not cache the MTU of the path found through PLPMTUD. This does mean that if you have a host making multiple TCP connections to your server over a small MTU path, each one of those are going to need to wait for the timeout.
- There is no net.ipv6.tcp_mtu_probing setting in sysctl.conf. However if you enable this setting for IPv4 then IPv6 has the same behavior as IPv4:
Windows can also be configured for PLPMTUD but I’ll leave it up to the reader to figure out how to do that.
I showed in part 1 that the server will cache an entry if the MTU is lower than the local link. By default, Debian will cache this entry for 10 minutes. This time is adjustable via sysctl.conf:
[email protected]:~# sysctl -a | grep mtu_expires net.ipv4.route.mtu_expires = 600 net.ipv6.route.mtu_expires = 600
As soon as a value is cached, the timer starts. This timer counts down even if there is an existing file transfer. The reason is because paths can change. While the transfer is going on it could move to a path which has no MTU issues. We would want the server to then increase it’s MTU. Doing this too quickly can cause more traffic to drop and so the suggestion is to cache the MTU for 10 whole minutes and then try to increase. I’ve started a file transfer which is ongoing and then checked the cache entry on the server. You can see the timer going down:
The client has then finished downloading and disconnected from the server. At this point the server still keeps that cache entry. This ensures that if the client connects again shortly it will start with an MTU of 1400:
I’ve started a new download within the cache time above and we can see the server immediately starts sending packets with the correct MTU:
What should happen when the cache times out is that the server should try to send a larger MTU packet, up to the local MTU. I don’t see that with Debian though. I started the test with the lower MTU cached on my server. When the cache was about to expire above I started the test again and as expected the session starts with the lower cached MTU. I then changed the MTU between R2 and R5 back up to the regular MTU:
R2(config)#int fa0/1 R2(config-if)#no ip mtu 1400 R2(config-if)#end
The odd thing is, when the cache entry timed out, Debian carried on sending packets with an MTU of 1400 and cached the entry again. That’s not supposed to happen.
I then tried the same test again, this time manually clearing the cache on Debian:
[email protected]:~#ip route flush cache
IPv6 has roughly the same broken behavior. At first the cache is created and starts to count down. I started a transfer when it was about to expire. This time it again stayed at 1400, but the timer jumped into a huge number:
8590471 seconds is roughly 99 days. Not sure if this is a bug or what exactly.
Clearing the IPv6 cache on the other hand had the required effect:
If the MTU matches the outgoing interface, there is no need for the system to cache that entry taking up more resources on the server. Wireshark shows the jump in MTU:
- Blocking the required ICMP packets breaks PMTUD completely.
- There are alternatives to PMTUD, but they are slower initially.
- Test your OS’s behavior. I mainly tested with Debian and I ran into a number of ‘odd’ scenarios. Mainly to do with the cache.