One of IPv6’s features is the fact that routers are no longer supposed to fragment packets. Rather it’s up to the hosts on either end to work out the path MTU. This is different in IPv4 in which the routers along the path could fragment the packet. Both IPv4 and IPv6 have a mechanism to work out the path MTU which is what I’ll go over in this post. Instead of going over each separately, I’ll show what problem is trying to be solved and how both differ when it comes to sending traffic.
When you visit this blog, your browser is requesting a particular web page from my server. This request is usually quite small. My server needs to respond to that packet with some actual data. This includes the images, words, plugins, style-sheets, etc. This data can be quite large. My server needs to break down this stream of data into IP packets to send back to you. Each packet requires a few headers, and so the most optimum way to send data back to you is the biggest amount of data in the smallest amount of packets.
Between you and my server sits a load of different networks and hardware. There is no way for my server to know the maximum MTU supported by all those devices along the path. Not only can this path change, but I have many thousands of readers in thousands of different countries. In the topology above, the link between R2 and R4 has an MTU of 1400. None of the hosts are directly connected to that segment and so none of them know the MTU of the entire path.
Path MTU Discovery, RFC1191 for IPv4 and RFC1981 for IPv6, does exactly what the name suggests. Find out the MTU of the path. There are a number of similarities between the two RFCs, but a few key differences which I’ll dig into.
Note – OS implementations of PMTUD can vary widely. I’ll be showing both Debian Linux server 7.6.0 and Windows Server 2012 in this post.
Both RFCs state that hosts should always assume first that the MTU across the entire path matches the first hop MTU. i.e. The servers should assume that the MTU matches the MTU on the link they are connected. In this case both my Windows and Linux servers have a local MTU of 1500.
The link between R1 and R4 has an IP MTU of 1400. My servers would need to figure the path MTU in order to maximise the packet size without fragmentation.
The basic idea is that a source host initially assumes that the PMTU of a path is the (known) MTU of its first hop, and sends all datagrams on that path with the DF bit set. If any of the datagrams are too large to be forwarded without fragmentation by some router along the path, that router will discard them and return ICMP Destination Unreachable messages with a code meaning “fragmentation needed and DF set” . Upon receipt of such a message (henceforth called a “Datagram Too Big” message), the source host reduces its assumed PMTU for the path.
In my example, the servers should assume that the path MTU is 1500. They should send packets back to the user using this MTU and setting the Do Not Fragment bit. R2’s link to R4 is not big enough and so should drop the packet and return the correct ICMP message back to my servers. Those servers should then send those packets again with a lower MTU.
I’m going to show Wireshark capture from the servers point of view. I’ll start with Windows.
The first part is the regular TCP 3-way handshake to set up the session. These packets are very small so are generally not fragmented:
The user then requests a file. The server responds with full size packets with the DF bit set. Those packets are dropped by R2, who sends back the required ICMP message:
Second, the ICMP message sent from R2. This is an ICMP Type 3 Code 4 message. It states the destination is unreachable and that fragmentation is required. Note it also states the MTU of the next-hop. The Windows server can use this value to re-originate it’s packets with a lower MTU.
RFC1191 states that a server should cache a lower MTU value. It’s also suggested that this value is cached for 10 minutes, and should be tweakable. You can view the cached value on Windows, but it doesn’t show the timer. Perhaps a reader could let me know?
I’ll now do the same on my Debian server. First part is the 3-way handshake again:
The server starts sending packets with an MTU of 1500:
Which are dropped by R2, with ICMP messages sent back:
The Debian server will cache that entry. Debian does show me the remaining cache time, in this case 584 seconds:
RFC1981 goes over the finer details of how this works with IPv6. The majority of the document is identical to the RFC1191 version.
When the Debian server responds, the packets have a size of 1514 on the wire as expected. Note however that there is no DF bit in IPv6 packets. This is a major difference between IPv4 and IPV6 right here. Routers CANNOT fragment IPv6 packets and hence there is no reason to explicitly state this in the packet. All IPv6 packets are non-fragmentable by routers in the path. I’ll go over what this means in depth later.
R2 cannot forward this packet and drops it. The message returned by R2 is still an ICMP message, but it’s a bit different to the IPv4 version:
Windows server 2012 has identical behaviour. To show the cache simply view the ipv6 destinationcache and you’re good to go.
So what could possibly go wrong? The above all looks good and works in the lab. The biggest issue is that both require those ICMP messages to come back to the sending host. There are a load of badly configured firewalls and ACLs out there dropping more ICMP than they are supposed to. Some people even drop ALL ICMP. There is another issue that I’ll go over in another blog post in the near future.
In the above examples, if those ICMP messages don’t get back, the sending host will not adjust it’s MTU. If it continues to send large packets, the router with a smaller MTU will drop that packet. All that traffic is blackholed. Smaller packets like requests will get through. Ping will even get through if echo-requests and echo-replies have been let through. You might even be able to see the beginnings of a web page, but the big content will not load.
On R1’s fa0/1 interface I’ll create this bad access list:
R1#sh ip access-lists Extended IP access list BLOCK-ICMP 10 permit icmp any any echo 20 permit icmp any any echo-reply 30 deny icmp any any 40 permit ip any any
But try to download the file:
The initial 3-way handshake works fine, but nothing else happens. The Debian server is sending those packets, R2 is dropping and informing the sender, but R1 drops those packets. You’ve now got a black-hole. The same things happens with IPv6, though of course the packet dropped is the Packet Too Big message.
The best thing to do is fix the problem. Unfortunately that’s not always possible. There are a few things that can be done to work through the problem of dropped ICMP packets.
If you know the MTU value further down the line, you can use TCP clamping. This causes the router to intercept TCP SYN packets and rewrite the TCP MSS. You need to take into account the size of the added headers.
1#conf t Enter configuration commands, one per line. End with CNTL/Z. R1(config)#int fa1/1 R1(config-if)#ip tcp adjust-mss 1360 R1(config-if)#end
The problem with this is that it’s a burden on the router configured. Your router might not even support this option. This also affects ALL TCP traffic going through that router. TCP clamping can work well for VPN tunnels, but it’s not a very scalable solution.
Another workaround can be to get the router to disregard the DF bit and just let the routers fragment the packets:
route-map CLEAR-DF permit 10 set ip df 0 ! interface FastEthernet1/1 ip address 192.168.4.1 255.255.255.0 ip router isis ip policy route-map CLEAR-DF ipv6 address 2001:DB8:10:14::1/64 ipv6 router isis
The problem with this is that you’re placing burden on the router again. It’s also not at all efficient. Some firewalls also block fragments. Some routers might just drop fragmented packets.
The biggest problem with this is that there is no df-bit to clear in IPv6. IPv6 packets will not be fragmented by routers. It has to be done by the host.
End of Part One
There is simply too much to cover in a single post. I’ll end this post here. Part two will be coming soon!