Debian/Ubuntu PMTUD & uRPF

I originally started my PMTUD posts using Ubuntu 14.04. Halfway through the post I simply could not get Ubuntu to change it’s MTU on receipt of ICMP fragmentation needed messages. I then tried Debian and it worked. Windows also had no issues changing it’s MTU.

Wanting to finish off the post I switched to Debian and then would investigate the fault later.

Let’s remind ourselves of the original topology:
pmtu 11 Debian/Ubuntu PMTUD & uRPF
Swap out Debian for Ubuntu in the above image.

When I initially started to test, I dropped the MTU between R1 and R2 to 1400. The link between R2 and R4 was kept at 1500. If the user requested a file from the server at this point, Ubuntu would attempt to send at 1500 and get it’s packet dropped at R1. R1 would send a Fragmentation Needed packet back to the Ubuntu server, which would adjust it’s MTU and then send at 1400.

When I changed the MTU between R1 and R2 back up to 1500 and dropped R2-R4 down to 1400, it no longer worked. Debian and Windows did work. I ran tcpdump on Ubuntu and confirmed that it was definitely getting Fragmentation Needed packets. Ubuntu was only acting on Fragmentation Needed packets if it came from it’s default gateway, R1. Any router further along in the path was getting it’s ICMP packets ignored.

In order to understand what the problem is I need to show more about the topology. While the above diagram shows how thins are connected for the most part, it is missing a couple of things. All the devices are running inside virtualbox linked to GNS3. eth1 of all the servers are connected to the above topology, while eth0 was connected via NAT to my host PC so I could install software:
PMTUD uRPF Debian/Ubuntu PMTUD & uRPF
Each device had a static route to 192.168.0.0/16 to go out eth1 while their default route was out eth0. Some of you may be sensing what the issue is already…

The point to point links between the virtual routers are using the 10.0.0.0/8 space.

If Ubuntu received an ICMP packet from 192.168.4.1, it’s local default gateway on R1, there were no issues. If it received a packet from R2 or R4′s local interfaces, the packet was dropped. Debian and Windows both didn’t have problems, even though they are configured the same way.

sysctl.conf

I’ve touched on sysctl.conf before the the PMTUD posts, but there is an important difference in the defaults of Ubuntu and Debian. Take a look at this.
Debian:
Screen Shot 2014 09 02 at 9.11.50 am Debian/Ubuntu PMTUD & uRPF
Ubuntu:
Screen Shot 2014 09 02 at 9.12.01 am Debian/Ubuntu PMTUD & uRPF

uRPF

Ubuntu has Unicast Reverse Path Forwarding on by default. Debian has it off by default. In sysctl.conf on both machines, the required configuration setting is commented out:
Screen Shot 2014 09 02 at 9.15.44 am Debian/Ubuntu PMTUD & uRPF
R2 was originating it’s ICMP packets from it’s local interface, 10.0.12.2 in my example. Ubuntu did receive that packet, but it failed the RPF check and so was ignored. To confirm I tested this in two different ways:

  • Add a static route to 10.0.0.0/8 out eth1
  • Disable uRPF check on Ubuntu

Each test individually allowed the original PMTUD to work. What’s odd is that the sysctl.conf file in Ubuntu says that you need to uncomment the lines to turn on uRPF, but it’s on by default. Uncommenting the lines and setting the value to 1 is the same as leaving them commented. In Debian the default is to disable uRPF. In that distro you would need to uncomment the uRPF lines and set the value to 1 to turn the feature on.

Conclusions

  • If a server is multi-homed, PMTUD could break if the ICMP message arrives on an interface that the server is not expecting.
  • If you do have a server multi-homed, it would probably be best to turn off uRPF

Fundamentals – PMTUD – IPv4 vs IPv6 – Part 2 of 2

This is a continuation of a post I started back here. Please read it first before starting below.

RFC 4821

Another workaround we can use is Packetization Layer Path MTU Discovery – RFC 4821. The RFC enables a host to mainly acts in one of two ways:

  • Use regular PMTUD. If no acknowledgments are received and no ICMP messages are received, start to probe.
  • Ignore regular PMTUD and always probe.

Probing is where the host will send a packet with the min MTU configured and then attempt to increase that size. If acknowledgements are received on the larger size, then try increase it again. Option 1 will wait for a timeout so on broken PMTUD paths it starts a bit slow. It will however use regular PMTUD whenever it can so it’s a lot more efficient. Option 2 simple probes all the time. It starts a bit quicker on smaller MTU paths, but the server is also sending smaller packets to ALL paths in the beginning. Much less efficient.

I’ll configure this in Debian and then go through Wireshark to show what’s going on. Add the commands net.ipv4.tcp_mtu_probing = 1 to /etc/sysctl.conf then reload sysctl:

root@debian1:~# sysctl -p
net.ipv4.tcp_mtu_probing = 1

Start the transfer and what does Wireshark show us:
Screen Shot 2014 08 29 at 3.43.59 pm Fundamentals   PMTUD – IPv4 vs IPv6 – Part 2 of 2
After the standard 3-way handshake, the server sends a number of 1514 byte packets. ICMP has been blocked and as such there are no ICMP fragmentation needed messages coming from R2. After 5.3 seconds the server sends a number of 578 byte packets.
Screen Shot 2014 08 29 at 3.43.24 pm Fundamentals   PMTUD – IPv4 vs IPv6 – Part 2 of 2
These get ACK’d correctly:
Screen Shot 2014 08 29 at 3.44.59 pm Fundamentals   PMTUD – IPv4 vs IPv6 – Part 2 of 2
0.5 seconds later the server sends a single 1090 byte packets and fill the rest of the window with 578 byte packets. As soon as the ACK for that big packet comes back, the server sends all of its packets at 1090:
Screen Shot 2014 08 29 at 3.47.31 pm Fundamentals   PMTUD – IPv4 vs IPv6 – Part 2 of 2
Screen Shot 2014 08 29 at 3.48.01 pm Fundamentals   PMTUD – IPv4 vs IPv6 – Part 2 of 2

A couple of things to note about this setting in Ubuntu 14.04 and Debian 7.6.0:

  1. The system does not cache the MTU of the path found through PLPMTUD. This does mean that if you have a host making multiple TCP connections to your server over a small MTU path, each one of those are going to need to wait for the timeout.
  2. There is no net.ipv6.tcp_mtu_probing setting in sysctl.conf. However if you enable this setting for IPv4 then IPv6 has the same behavior as IPv4:

Screen Shot 2014 08 29 at 3.54.50 pm Fundamentals   PMTUD – IPv4 vs IPv6 – Part 2 of 2

Windows can also be configured for PLPMTUD but I’ll leave it up to the reader to figure out how to do that.

PMTUD Cache

I showed in part 1 that the server will cache an entry if the MTU is lower than the local link. By default, Debian will cache this entry for 10 minutes. This time is adjustable via sysctl.conf:

root@debian1:~# sysctl -a | grep mtu_expires
net.ipv4.route.mtu_expires = 600
net.ipv6.route.mtu_expires = 600

As soon as a value is cached, the timer starts. This timer counts down even if there is an existing file transfer. The reason is because paths can change. While the transfer is going on it could move to a path which has no MTU issues. We would want the server to then increase it’s MTU. Doing this too quickly can cause more traffic to drop and so the suggestion is to cache the MTU for 10 whole minutes and then try to increase. I’ve started a file transfer which is ongoing and then checked the cache entry on the server. You can see the timer going down:
Screen Shot 2014 09 01 at 1.22.23 pm Fundamentals   PMTUD – IPv4 vs IPv6 – Part 2 of 2
The client has then finished downloading and disconnected from the server. At this point the server still keeps that cache entry. This ensures that if the client connects again shortly it will start with an MTU of 1400:
Screen Shot 2014 09 01 at 1.30.12 pm Fundamentals   PMTUD – IPv4 vs IPv6 – Part 2 of 2
I’ve started a new download within the cache time above and we can see the server immediately starts sending packets with the correct MTU:
Screen Shot 2014 09 01 at 1.37.23 pm Fundamentals   PMTUD – IPv4 vs IPv6 – Part 2 of 2

What should happen when the cache times out is that the server should try to send a larger MTU packet, up to the local MTU. I don’t see that with Debian though. I started the test with the lower MTU cached on my server. When the cache was about to expire above I started the test again and as expected the session starts with the lower cached MTU. I then changed the MTU between R2 and R5 back up to the regular MTU:

R2(config)#int fa0/1
R2(config-if)#no ip mtu 1400
R2(config-if)#end

The odd thing is, when the cache entry timed out, Debian carried on sending packets with an MTU of 1400 and cached the entry again. That’s not supposed to happen.

I then tried the same test again, this time manually clearing the cache on Debian:

root@debian1:~#ip route flush cache

This time the server immediately started to send larger packets:
Screen Shot 2014 09 01 at 2.16.48 pm Fundamentals   PMTUD – IPv4 vs IPv6 – Part 2 of 2

IPv6 has roughly the same broken behavior. At first the cache is created and starts to count down. I started a transfer when it was about to expire. This time it again stayed at 1400, but the timer jumped into a huge number:
Screen Shot 2014 09 01 at 3.59.22 pm Fundamentals   PMTUD – IPv4 vs IPv6 – Part 2 of 2
8590471 seconds is roughly 99 days. Not sure if this is a bug or what exactly.

Clearing the IPv6 cache on the other hand had the required effect:
Screen Shot 2014 09 01 at 4.04.20 pm Fundamentals   PMTUD – IPv4 vs IPv6 – Part 2 of 2
If the MTU matches the outgoing interface, there is no need for the system to cache that entry taking up more resources on the server. Wireshark shows the jump in MTU:
Screen Shot 2014 09 01 at 4.05.32 pm Fundamentals   PMTUD – IPv4 vs IPv6 – Part 2 of 2

Conclusions

  • Blocking the required ICMP packets breaks PMTUD completely.
  • There are alternatives to PMTUD, but they are slower initially.
  • Test your OS’s behavior. I mainly tested with Debian and I ran into a number of ‘odd’ scenarios. Mainly to do with the cache.

Fundamentals – PMTUD – IPv4 & IPv6 – Part 1 of 2

One of IPv6′s features is the fact that routers are no longer supposed to fragment packets. Rather it’s up to the hosts on either end to work out the path MTU. This is different in IPv4 in which the routers along the path could fragment the packet. Both IPv4 and IPv6 have a mechanism to work out the path MTU which is what I’ll go over in this post. Instead of going over each separately, I’ll show what problem is trying to be solved and how both differ when it comes to sending traffic.

I’ll be using the following topology in this post:
pmtu 11 Fundamentals   PMTUD   IPv4 & IPv6   Part 1 of 2

The problem

When you visit this blog, your browser is requesting a particular web page from my server. This request is usually quite small. My server needs to respond to that packet with some actual data. This includes the images, words, plugins, style-sheets, etc. This data can be quite large. My server needs to break down this stream of data into IP packets to send back to you. Each packet requires a few headers, and so the most optimum way to send data back to you is the biggest amount of data in the smallest amount of packets.

Between you and my server sits a load of different networks and hardware. There is no way for my server to know the maximum MTU supported by all those devices along the path. Not only can this path change, but I have many thousands of readers in thousands of different countries. In the topology above, the link between R2 and R4 has an MTU of 1400. None of the hosts are directly connected to that segment and so none of them know the MTU of the entire path.
pmtu 2 Fundamentals   PMTUD   IPv4 & IPv6   Part 1 of 2

PMTUD

Path MTU Discovery, RFC1191 for IPv4 and RFC1981 for IPv6, does exactly what the name suggests. Find out the MTU of the path. There are a number of similarities between the two RFCs, but a few key differences which I’ll dig into.

Note – OS implementations of PMTUD can vary widely. I’ll be showing both Debian Linux server 7.6.0 and Windows Server 2012 in this post.

Both RFCs state that hosts should always assume first that the MTU across the entire path matches the first hop MTU. i.e. The servers should assume that the MTU matches the MTU on the link they are connected. In this case both my Windows and Linux servers have a local MTU of 1500.
pmtu 3 Fundamentals   PMTUD   IPv4 & IPv6   Part 1 of 2

The link between R1 and R4 has an IP MTU of 1400. My servers would need to figure the path MTU in order to maximise the packet size without fragmentation.

  • IPv4
  • RFC1191 states:

    The basic idea is that a source host initially assumes that the PMTU of a path is the (known) MTU of its first hop, and sends all datagrams on that path with the DF bit set. If any of the datagrams are too large to be forwarded without fragmentation by some router along the path, that router will discard them and return ICMP Destination Unreachable messages with a code meaning “fragmentation needed and DF set” [7]. Upon receipt of such a message (henceforth called a “Datagram Too Big” message), the source host reduces its assumed PMTU for the path.

    In my example, the servers should assume that the path MTU is 1500. They should send packets back to the user using this MTU and setting the Do Not Fragment bit. R2′s link to R4 is not big enough and so should drop the packet and return the correct ICMP message back to my servers. Those servers should then send those packets again with a lower MTU.

    I’m going to show Wireshark capture from the servers point of view. I’ll start with Windows.

    The first part is the regular TCP 3-way handshake to set up the session. These packets are very small so are generally not fragmented:
    Screen Shot 2014 08 25 at 12.37.40 Fundamentals   PMTUD   IPv4 & IPv6   Part 1 of 2
    The user then requests a file. The server responds with full size packets with the DF bit set. Those packets are dropped by R2, who sends back the required ICMP message:
    Screen Shot 2014 08 25 at 12.39.49 Fundamentals   PMTUD   IPv4 & IPv6   Part 1 of 2

    Dig a bit deeper into those packets. First the full size packet from the server. Note the DF-bit has been set:
    Screen Shot 2014 08 25 at 12.43.49 Fundamentals   PMTUD   IPv4 & IPv6   Part 1 of 2

    Second, the ICMP message sent from R2. This is an ICMP Type 3 Code 4 message. It states the destination is unreachable and that fragmentation is required. Note it also states the MTU of the next-hop. The Windows server can use this value to re-originate it’s packets with a lower MTU.

    All the rest of the packets in the capture then have the lower MTU set. Note that Wireshark shows the ethernet MTU as well hence the value of 1414:
    Screen Shot 2014 08 25 at 12.49.11 Fundamentals   PMTUD   IPv4 & IPv6   Part 1 of 2

    RFC1191 states that a server should cache a lower MTU value. It’s also suggested that this value is cached for 10 minutes, and should be tweakable. You can view the cached value on Windows, but it doesn’t show the timer. Perhaps a reader could let me know?
    Screen Shot 2014 08 25 at 12.53.53 Fundamentals   PMTUD   IPv4 & IPv6   Part 1 of 2

    I’ll now do the same on my Debian server. First part is the 3-way handshake again:
    Screen Shot 2014 08 26 at 1.27.02 pm Fundamentals   PMTUD   IPv4 & IPv6   Part 1 of 2
    The server starts sending packets with an MTU of 1500:
    Screen Shot 2014 08 26 at 1.28.48 pm Fundamentals   PMTUD   IPv4 & IPv6   Part 1 of 2
    Which are dropped by R2, with ICMP messages sent back:
    Screen Shot 2014 08 26 at 1.29.52 pm Fundamentals   PMTUD   IPv4 & IPv6   Part 1 of 2
    The Debian server will cache that entry. Debian does show me the remaining cache time, in this case 584 seconds:
    Screen Shot 2014 08 26 at 1.32.23 pm Fundamentals   PMTUD   IPv4 & IPv6   Part 1 of 2

  • IPv6
  • RFC1981 goes over the finer details of how this works with IPv6. The majority of the document is identical to the RFC1191 version.

    When the Debian server responds, the packets have a size of 1514 on the wire as expected. Note however that there is no DF bit in IPv6 packets. This is a major difference between IPv4 and IPV6 right here. Routers CANNOT fragment IPv6 packets and hence there is no reason to explicitly state this in the packet. All IPv6 packets are non-fragmentable by routers in the path. I’ll go over what this means in depth later.
    Screen Shot 2014 08 27 at 8.06.39 am Fundamentals   PMTUD   IPv4 & IPv6   Part 1 of 2
    R2 cannot forward this packet and drops it. The message returned by R2 is still an ICMP message, but it’s a bit different to the IPv4 version:
    Screen Shot 2014 08 27 at 8.10.56 am Fundamentals   PMTUD   IPv4 & IPv6   Part 1 of 2

    This time the message is ‘Packet too big’ – Very easy to figure out what that means. The ICMP message will contain the MTU of the next-hop as expected:
    Screen Shot 2014 08 27 at 8.14.02 am Fundamentals   PMTUD   IPv4 & IPv6   Part 1 of 2

    The server will act on this message, cache the result, then send packets up to the required MTU:
    Screen Shot 2014 08 27 at 8.17.30 am Fundamentals   PMTUD   IPv4 & IPv6   Part 1 of 2
    Screen Shot 2014 08 27 at 8.18.29 am Fundamentals   PMTUD   IPv4 & IPv6   Part 1 of 2

    Windows server 2012 has identical behaviour. To show the cache simply view the ipv6 destinationcache and you’re good to go.

    Problems

    So what could possibly go wrong? The above all looks good and works in the lab. The biggest issue is that both require those ICMP messages to come back to the sending host. There are a load of badly configured firewalls and ACLs out there dropping more ICMP than they are supposed to. Some people even drop ALL ICMP. There is another issue that I’ll go over in another blog post in the near future.

    In the above examples, if those ICMP messages don’t get back, the sending host will not adjust it’s MTU. If it continues to send large packets, the router with a smaller MTU will drop that packet. All that traffic is blackholed. Smaller packets like requests will get through. Ping will even get through if echo-requests and echo-replies have been let through. You might even be able to see the beginnings of a web page, but the big content will not load.

    On R1′s fa0/1 interface I’ll create this bad access list:

    R1#sh ip access-lists
    Extended IP access list BLOCK-ICMP
        10 permit icmp any any echo
        20 permit icmp any any echo-reply
        30 deny icmp any any
        40 permit ip any any

    From the client I can ping the host:
    Screen Shot 2014 08 27 at 8.31.41 am Fundamentals   PMTUD   IPv4 & IPv6   Part 1 of 2
    I can even open a text-based page from the server:
    Screen Shot 2014 08 27 at 8.32.30 am Fundamentals   PMTUD   IPv4 & IPv6   Part 1 of 2

    But try to download the file:
    Screen Shot 2014 08 27 at 8.33.39 am Fundamentals   PMTUD   IPv4 & IPv6   Part 1 of 2
    The initial 3-way handshake works fine, but nothing else happens. The Debian server is sending those packets, R2 is dropping and informing the sender, but R1 drops those packets. You’ve now got a black-hole. The same things happens with IPv6, though of course the packet dropped is the Packet Too Big message.

    Workarounds

    The best thing to do is fix the problem. Unfortunately that’s not always possible. There are a few things that can be done to work through the problem of dropped ICMP packets.
    If you know the MTU value further down the line, you can use TCP clamping. This causes the router to intercept TCP SYN packets and rewrite the TCP MSS. You need to take into account the size of the added headers.

    1#conf t
    Enter configuration commands, one per line.  End with CNTL/Z.
    R1(config)#int fa1/1
    R1(config-if)#ip tcp adjust-mss  1360
    R1(config-if)#end

    Note how the MSS value has been changed to 1360:
    Screen Shot 2014 08 28 at 1.46.58 pm Fundamentals   PMTUD   IPv4 & IPv6   Part 1 of 2

    I’ve tested with IOS 15.2(4)S2 and it also works with IPv6:
    Screen Shot 2014 08 28 at 1.54.57 pm Fundamentals   PMTUD   IPv4 & IPv6   Part 1 of 2

    The problem with this is that it’s a burden on the router configured. Your router might not even support this option. This also affects ALL TCP traffic going through that router. TCP clamping can work well for VPN tunnels, but it’s not a very scalable solution.

    Another workaround can be to get the router to disregard the DF bit and just let the routers fragment the packets:

    route-map CLEAR-DF permit 10
     set ip df 0
    !
    interface FastEthernet1/1
     ip address 192.168.4.1 255.255.255.0
     ip router isis
     ip policy route-map CLEAR-DF
     ipv6 address 2001:DB8:10:14::1/64
     ipv6 router isis

    The problem with this is that you’re placing burden on the router again. It’s also not at all efficient. Some firewalls also block fragments. Some routers might just drop fragmented packets.
    The biggest problem with this is that there is no df-bit to clear in IPv6. IPv6 packets will not be fragmented by routers. It has to be done by the host.

    End of Part One

    There is simply too much to cover in a single post. I’ll end this post here. Part two will be coming soon!

    Demystifying the IS-IS database

    I’ve gone over the OSPFv2 and OSPFv3 databases in depth before. Now is the time for IS-IS. As always, I’ll start from a basic two router set up and add devices to the topology.

    Basic LSPs

    In OSPF we use the term LSA, Link-State Advertisement. In IS-IS we use the term LSP – Link-State PDUs. Further expanded into Link-State Protocol Data Units. Not to be confused with Label Switched Paths.

    This is the topology we’ll start with:
    IS IS 1 Demystifying the IS IS database
    Like OSPF, IS-IS will treat ethernet links as broadcast by default. In OSPF a DR and BDR will be elected. In IS-IS a single DIS (Designated Intermediate System) is elected with no backup DIS. This DIS election is also pre-emtptive, unlike OSPF. The DIS will originate an LSP representing the DIS. This means I should have three LSPs in the database currently:

    RP/0/0/CPU0:XR1#show isis database
    Tue Aug 12 17:34:21.594 UTC
    
    IS-IS 1 (Level-2) Link State Database
    LSPID                 LSP Seq Num  LSP Checksum  LSP Holdtime  ATT/P/OL
    XR1.00-00           * 0x00000003   0x8577        736             0/0/0
    XR1.01-00             0x00000002   0x1fba        931             0/0/0
    XR2.00-00             0x00000005   0x856b        806             0/0/0
    
     Total Level-2 LSP count: 3     Local Level-2 LSP count: 1

    XR2 has a single LSP with XR1 has two. The XR1.01 LSP is the DIS LSP. Dig deeper into the LSPs to see their current content:

    RP/0/0/CPU0:XR1#show isis database XR1.00-00 detail
    Tue Wed 12 17:38:23.307 UTC
    
    IS-IS 1 (Level-2) Link State Database
    LSPID                 LSP Seq Num  LSP Checksum  LSP Holdtime  ATT/P/OL
    XR1.00-00           * 0x00000003   0x8577        494             0/0/0
      Area Address: 49.0001
      NLPID:        0xcc
      Hostname:     XR1
      IP Address:   1.1.1.1
      Metric: 10         IS XR1.01
      Metric: 10         IP 1.1.1.1/32
      Metric: 10         IP 10.0.12.0/24

    XR1 has originated an LSP stating what area it’s in and hostname. Notice the NLPID value. This means Network Layer Protocol IDentifier. The value of 0xcc translates to IPv4. Further down the LSP contains the IS of XR1 itself, plus two IP ranges. All these with metrics to those IS and IPs. I’ll get onto the ATT/P/OL bits later so ignore those for now.

    It’s important to note that an LSP is made up of several TLVs. On the wire multiple TLVs can be grouped together in a single frame. If large enough, IS-IS will fragment these frames.

    As XR1 is the DIS, there is a separate DIS LSP, let’s take a look at that:

    RP/0/0/CPU0:XR1#show isis database XR1.01-00 detail
    Tue Aug 12 17:43:00.448 UTC
    
    IS-IS 1 (Level-2) Link State Database
    LSPID                 LSP Seq Num  LSP Checksum  LSP Holdtime  ATT/P/OL
    XR1.01-00             0x00000003   0x1dbb        1161            0/0/0
      Metric: 0          IS XR1.00
      Metric: 0          IS XR2.00

    The DIS LSP advertises all the IS’ that are on the segment in which the DIS sits.

    If I change the segment to point-to-point, this removes the need of a DIS and as such there will be no DIS LSP.

    router isis 1
    !
     interface GigabitEthernet0/0/0/1
      point-to-point
    
    RP/0/0/CPU0:XR1#show isis database
    Tue Aug 12 18:46:50.566 UTC
    
    IS-IS 1 (Level-2) Link State Database
    LSPID                 LSP Seq Num  LSP Checksum  LSP Holdtime  ATT/P/OL
    XR1.00-00           * 0x0000000b   0x7480        674             0/0/0
    XR2.00-00             0x0000000d   0x5297        543             0/0/0
    
     Total Level-2 LSP count: 2     Local Level-2 LSP count: 1

    Externals

    I’m going to add another loopback interface on XR1 and redistribute that loopback into IS-IS. This will make the route external

    interface Loopback100
     ipv4 address 100.100.100.100 255.255.255.255
    !
    prefix-set LOOPBACK100
      100.100.100.100/32
    end-set
    !
    route-policy RP-100
      if destination in LOOPBACK100 then
        done
      else
        drop
      endif
    end-policy
    !
    router isis 1
     address-family ipv4 unicast
      redistribute connected level-2 route-policy RP-100

    As I mentioned above, IS-IS has separate TLVs that make up the LSP. Therefore there is still only a single LSP from XR1:

    RP/0/0/CPU0:XR2#sh isis database
    Tue Aug 12 19:03:31.569 UTC
    
    IS-IS 1 (Level-2) Link State Database
    LSPID                 LSP Seq Num  LSP Checksum  LSP Holdtime  ATT/P/OL
    XR1.00-00             0x0000000d   0x6be5        1043            0/0/0
    XR2.00-00           * 0x00000010   0x9c8f        1094            0/0/0
    
     Total Level-2 LSP count: 2     Local Level-2 LSP count: 1

    The external route can be seen in the detailed output under that LSP:

    RP/0/0/CPU0:XR2#sh isis database XR1.00-00 detail
    Tue Aug 12 19:03:58.637 UTC
    
    IS-IS 1 (Level-2) Link State Database
    LSPID                 LSP Seq Num  LSP Checksum  LSP Holdtime  ATT/P/OL
    XR1.00-00             0x0000000d   0x6be5        1016            0/0/0
      Area Address: 49.0001
      NLPID:        0xcc
      Hostname:     XR1
      IP Address:   1.1.1.1
      Metric: 10         IS XR2.00
      Metric: 10         IP 1.1.1.1/32
      Metric: 10         IP 10.0.12.0/24
      Metric: 0          IP-External 100.100.100.100/32

    Inter-Area

    XR3 has now been added to the topology. I’ve had to move XR2 into the same area as XR3 otherwise they will not be able to form a L1 adjacency:
    IS IS 2 Demystifying the IS IS database

    the R2-R3 link has not been changed to point-to-point, and as such I would expect to see three LSPs in XR3s database:

    RP/0/0/CPU0:XR3#show isis database
    Tue Aug 12 09:44:40.660 UTC
    
    IS-IS 1 (Level-1) Link State Database
    LSPID                 LSP Seq Num  LSP Checksum  LSP Holdtime  ATT/P/OL
    XR2.00-00             0x00000008   0xd230        1107            1/0/0
    XR3.00-00           * 0x00000008   0xf1be        1105            0/0/0
    XR3.07-00             0x00000003   0xfcd3        1105            0/0/0
    
     Total Level-1 LSP count: 3     Local Level-1 LSP count: 1

    If you look at XR2′s L1 LSP in detail you now see the ATT bit set. Also note it’s advertising only it’s directly connected interfaces:

    RP/0/0/CPU0:XR3#show isis database XR2.00-00 detail
    Tue Aug 12 19:45:51.025 UTC
    
    IS-IS 1 (Level-1) Link State Database
    LSPID                 LSP Seq Num  LSP Checksum  LSP Holdtime  ATT/P/OL
    XR2.00-00             0x00000008   0xd230        1037            1/0/0
      Area Address: 49.0023
      NLPID:        0xcc
      Hostname:     XR2
      IP Address:   2.2.2.2
      Metric: 10         IS XR3.07
      Metric: 10         IP 2.2.2.2/32
      Metric: 10         IP 10.0.12.0/24
      Metric: 10         IP 10.0.23.0/24

    XR2 has set the ATT bit which is the attached bit. An L1/L2 router will set this bit in the LSP inside the L1 area it’s connected to. This is to inform the L1 routers that it is attached to the L2 domain. No actual default route is advertised, but L1 routers can create their own defaults pointing towards the attached routers:

    RP/0/0/CPU0:XR3#sh ip route 0.0.0.0
    Tue Aug 12 19:47:07.839 UTC
    
    Routing entry for 0.0.0.0/0
      Known via "isis 1", distance 115, metric 10, candidate default path, type level-1
      Installed Aug 12 19:43:09.476 for 00:03:58
      Routing Descriptor Blocks
        10.0.23.2, from 2.2.2.2, via GigabitEthernet0/0/0/0.23
          Route metric is 10
      No advertising protos.

    Notice from XR1′s persepctive, that any routes coming from an L1 area is simple flooded from the L1/L2 router as normal routes:

    RP/0/0/CPU0:XR1#show isis database XR2.00-00 detail
    Tue Aug 12 19:50:08.676 UTC
    
    IS-IS 1 (Level-2) Link State Database
    LSPID                 LSP Seq Num  LSP Checksum  LSP Holdtime  ATT/P/OL
    XR2.00-00             0x0000001b   0x5b3d        778             0/0/0
      Area Address: 49.0023
      NLPID:        0xcc
      Hostname:     XR2
      IP Address:   2.2.2.2
      Metric: 10         IS XR1.00
      Metric: 10         IP 2.2.2.2/32
      Metric: 20         IP 3.3.3.3/32
      Metric: 10         IP 10.0.12.0/24
      Metric: 10         IP 10.0.23.0/24
      Metric: 10         IP 200.200.200.200/32

    IS-IS gives you the ability to leak L2 prefixes into the L1 domain. This is handy when you have two L1/L2 border routers and want to engineer destiations to go on particular paths. From XR2 I’ll leak XR1′s loopback into the L1 domain. The database now shows:

    RP/0/0/CPU0:XR3#show isis database XR2.00-00 detail
    Tue Aug 12 21:53:13.981 UTC
    
    IS-IS 1 (Level-1) Link State Database
    LSPID                 LSP Seq Num  LSP Checksum  LSP Holdtime  ATT/P/OL
    XR2.00-00             0x0000002f   0x4e13        1193            1/0/0
      Area Address: 49.0023
      NLPID:        0xcc
      Hostname:     XR2
      IP Address:   2.2.2.2
      Router Cap:   2.2.2.2, D:0, S:0
      Metric: 10         IS XR3.07
      Metric: 20         IP-Interarea 1.1.1.1/32
      Metric: 10         IP 2.2.2.2/32
      Metric: 10         IP 10.0.23.0/24

    1.1.1.1/32 shows up in LSP as an IP-Interarea route. Again a TLV is used for this.

    IPv6

    When running both IPv4 and IPv6 at the same time, IS-IS can be run in single-topology or multi-topolgy mode. In single topology, all your IS-IS links need to have both v4 and v6 addresses as the SPF tree is run indenpently of prefix information. If the SPF tree is calculated to use a link without a v6 address, IPv6 traffic will be blackholed over that link.

    For now I’ve added an IPv6 loopback and interface address. I’ve got IS-IS running in multi topology mode. I should still only see two LSPs from XR1′s perspective:

    RP/0/0/CPU0:XR1#show isis database
    Tue Aug 12 23:47:02.152 UTC
    
    IS-IS 1 (Level-2) Link State Database
    LSPID                 LSP Seq Num  LSP Checksum  LSP Holdtime  ATT/P/OL
    XR1.00-00           * 0x0000001e   0x9683        1115            0/0/0
    XR2.00-00             0x0000002b   0x62fa        1117            0/0/0
    
     Total Level-2 LSP count: 2     Local Level-2 LSP count: 1

    IPv6 information is carried inside another TLV. Note also that there is a new NLPID value of 0x8e in the LSP. As you would guess this value represents IPv6:

    RP/0/0/CPU0:XR1#show isis database detail XR2.00-00
    Tue Aug 12 23:47:50.899 UTC
    
    IS-IS 1 (Level-2) Link State Database
    LSPID                 LSP Seq Num  LSP Checksum  LSP Holdtime  ATT/P/OL
    XR2.00-00             0x0000002b   0x62fa        1068            0/0/0
      Area Address: 49.0023
      NLPID:        0xcc
      NLPID:        0x8e
      MT:           Standard (IPv4 Unicast)
      MT:           IPv6 Unicast                                     0/0/0
      Hostname:     XR2
      IP Address:   2.2.2.2
      IPv6 Address: 2001:db8:2:2::2
      Metric: 10         IS XR1.00
      Metric: 10         IP 2.2.2.2/32
      Metric: 20         IP 3.3.3.3/32
      Metric: 10         IP 10.0.12.0/24
      Metric: 10         IP 10.0.23.0/24
      Metric: 10         IP 200.200.200.200/32
      Metric: 10         MT (IPv6 Unicast) IS-Extended XR1.00
      Metric: 10         MT (IPv6 Unicast) IPv6 2001:db8:2:2::2/128
      Metric: 10         MT (IPv6 Unicast) IPv6 2001:db8:12::/64

    When running multi-topology mode, you’ll see MT: plus the address families configured for multi-topology. If I change this to single topology:

    RP/0/0/CPU0:XR1#show isis database XR2.00-00 detail
    Tue Aug 12 23:11:20.989 UTC
    
    IS-IS 1 (Level-2) Link State Database
    LSPID                 LSP Seq Num  LSP Checksum  LSP Holdtime  ATT/P/OL
    XR2.00-00             0x00000023   0xd22a        1196            0/0/0
      Area Address: 49.0023
      NLPID:        0xcc
      NLPID:        0x8e
      Hostname:     XR2
      IP Address:   2.2.2.2
      IPv6 Address: 2001:db8:2:2::2
      Metric: 10         IS XR1.00
      Metric: 10         IP 2.2.2.2/32
      Metric: 10         IP 10.0.12.0/24
      Metric: 10         IP 10.0.23.0/24
      Metric: 10         IP 200.200.200.200/32
      Metric: 10         IPv6 2001:db8:2:2::2/128
      Metric: 10         IPv6 2001:db8:12::/64

    MT no longer shows up, and all TLVs are added as-is to the LSP.

    Traffic Engineering

    To enable TE, wide-metrics need to be enabled. Up until this point I’ve been using narrow metrics. Once enabled You can see the TE information in the LSP by doing a verbose output:

    RP/0/0/CPU0:XR1#show isis database verbose XR2.00-00
    Tue Aug 12 23:42:09.932 UTC
    
    IS-IS 1 (Level-2) Link State Database
    LSPID                 LSP Seq Num  LSP Checksum  LSP Holdtime  ATT/P/OL
    XR2.00-00             0x00000026   0x2dd8        910             0/0/0
      Area Address: 49.0023
      NLPID:        0xcc
      NLPID:        0x8e
      Hostname:     XR2
      IP Address:   2.2.2.2
      IPv6 Address: 2001:db8:2:2::2
      Router ID:    2.2.2.2
      Metric: 10         IS-Extended XR1.00
        Affinity: 0x00000000
        Interface IP Address: 10.0.12.2
        Neighbor IP Address: 10.0.12.1
        Physical BW: 1000000 kbits/sec
        Reservable Global pool BW: 0 kbits/sec
        Global Pool BW Unreserved:
          [0]: 0        kbits/sec          [1]: 0        kbits/sec
          [2]: 0        kbits/sec          [3]: 0        kbits/sec
          [4]: 0        kbits/sec          [5]: 0        kbits/sec
          [6]: 0        kbits/sec          [7]: 0        kbits/sec
        Admin. Weight: 167772160
        Ext Admin Group: Length: 32
          0x00000000   0x00000000
          0x00000000   0x00000000
          0x00000000   0x00000000
          0x00000000   0x00000000
      Metric: 10         IP-Extended 2.2.2.2/32
      Metric: 10         IP-Extended 10.0.12.0/24
      Metric: 10         IP-Extended 10.0.23.0/24
      Metric: 10         IP-Extended 200.200.200.200/32
      Metric: 10         IPv6 2001:db8:2:2::2/128
      Metric: 10         IPv6 2001:db8:12::/64

    Notice there there is no new NLPID value for TE. TE extensions are enabled under address-family ipv4 and hence it uses the 0xcc id. If/when RSVP-TE can use IPv6 natively, I could expect to see only the IPv6 ID.

    Overload

    IS-IS has the ability to set the overload bit in an LSP. This could be originated by the router itself if it was overwhelmed, but it can also be hard set when doing planned works for example. If the overload bit is set, other routers will route around the router.

    router isis 1
     set-overload-bit

    Note that OL bit set in the LSP:

    RP/0/0/CPU0:XR1#show isis database
    Tue Aug 12 23:32:58.107 UTC
    
    IS-IS 1 (Level-2) Link State Database
    LSPID                 LSP Seq Num  LSP Checksum  LSP Holdtime  ATT/P/OL
    XR1.00-00           * 0x0000001f   0x9484        947             0/0/0
    XR2.00-00             0x0000002e   0x97a4        1151            0/0/1
    
     Total Level-2 LSP count: 2     Local Level-2 LSP count: 1

    I no longer have access to R3 now as R2 is the only router connecting these two devices:

    RP/0/0/CPU0:XR1#ping 3.3.3.3
    Tue Aug 12 23:08:44.083 UTC
    Type escape sequence to abort.
    Sending 5, 100-byte ICMP Echos to 3.3.3.3, timeout is 2 seconds:
    UUUUU
    Success rate is 0 percent (0/5)

    I am still able to ping XR2 itself though:

    RP/0/0/CPU0:XR1#ping 2.2.2.2
    Tue Aug 12 23:09:32.870 UTC
    Type escape sequence to abort.
    Sending 5, 100-byte ICMP Echos to 2.2.2.2, timeout is 2 seconds:
    !!!!!
    Success rate is 100 percent (5/5), round-trip min/avg/max = 1/1/1 ms

    We’ve now seen the purpose of both the ATT and OL bits, so what is the P bit for? that bit is for the Partition Repair Bit which no vendor has implemented. i.e. it should always show 0.

    Segment Routing

    IS-IS is easily extended using new TLVs. If I enable segment routing under my IS-IS process, I see it added as a new TLV in the LSP:

    RP/0/0/CPU0:XR1#show isis database verbose XR2.00-00
    Tue Aug 12 23:50:35.855 UTC
    
    IS-IS 1 (Level-2) Link State Database
    LSPID                 LSP Seq Num  LSP Checksum  LSP Holdtime  ATT/P/OL
    XR2.00-00             0x00000036   0x252b        954             0/0/0
      Area Address: 49.0023
      NLPID:        0xcc
      NLPID:        0x8e
      MT:           Standard (IPv4 Unicast)
      MT:           IPv6 Unicast                                     0/0/0
      Hostname:     XR2
      IP Address:   2.2.2.2
      IPv6 Address: 2001:db8:2:2::2
      Router Cap:   2.2.2.2, D:0, S:0
        Segment Routing: I:1 V:0, SRGB Base: 900000 Range: 65535
      Metric: 10         IS XR1.00
      Metric: 10         IP 2.2.2.2/32
      Metric: 20         IP 3.3.3.3/32
      Metric: 10         IP 10.0.12.0/24
      Metric: 10         IP 10.0.23.0/24
      Metric: 10         IP 200.200.200.200/32
      Metric: 10         MT (IPv6 Unicast) IS-Extended XR1.00
      Metric: 10         MT (IPv6 Unicast) IPv6 2001:db8:2:2::2/128
      Metric: 10         MT (IPv6 Unicast) IPv6 2001:db8:12::/64

    The Accumulated IGP Metric Attribute for BGP

    This is an interesting draft which can ensure better paths are chosen in certain corner cases. Before this draft, BGP was able to redistribute the IGP metric as a MED value into BGP. The issue with MED is that it’s very low on the BGP best path algorithm. Note that Cisco/Brocade consider weight as primary, but I’ll ignore that for now

    1. Highest Local-Preference
    2. Shortest AS-Path
    3. Lowest Origin Code
    4. Lowest MED
    5. ETC

    MED is only number 4 in the pecking order. In a large network it might be difficult to get everything to match up to that point. Accumulated IGP Metric is a new non-transitive BGP path attribute that carries the IGP metric inside the BGP NLRI. Not only that, but the best-path algorithms are changed as follows:

    1. Highest Local-Preference
    2. Lowest AIGP Cost
    3. Shortest AS-Path
    4. Lowest Origin Code
    5. Lowest MED
    6. ETC

    As long as your local-preference values match, the lowest AIGP cost is taken into account.

    No AIGP

    Take the following topology into consideration:
    BGP AIGP The Accumulated IGP Metric Attribute for BGP
    Assuming all link costs are the same, the shortest path for XR2 to get to IOS2 is via path XR1-XR4-IOS2. I’m going to ignore MED on XR2 for now.

    Quick relevant configs on XR1 and IOS1:

    interface GigabitEthernet0/0/0/1
     description Link to XR2
     ipv4 address 10.0.12.1 255.255.255.0
    !
    interface GigabitEthernet0/0/0/2
     description Link to XR3
     ipv4 address 10.0.13.1 255.255.255.0
    !
    prefix-set 20.20.20.20
      20.20.20.20/32
    end-set
    !
    route-policy PASS
      pass
    end-policy
    !
    route-policy IOS2_LOOPBACK
      if destination in 20.20.20.20 then
        done
      else
        drop
      endif
    end-policy
    !
    router isis 1
     is-type level-2-only
     net 49.0001.0000.0000.0001.00
     address-family ipv4 unicast
      metric-style wide
     !
     interface Loopback0
      address-family ipv4 unicast
      !
     !
     interface GigabitEthernet0/0/0/2
      address-family ipv4 unicast
      !
     !
    !
    router bgp 64512
     address-family ipv4 unicast
      redistribute isis 1 route-policy IOS2_LOOPBACK
     !
     neighbor 10.0.12.2
      remote-as 64513
      address-family ipv4 unicast
       route-policy PASS in
       route-policy PASS out
      !
     !
    !
    interface Loopback0
     ip address 10.10.10.10 255.255.255.255
     ip router isis 1
    !
    interface GigabitEthernet0/0
    !
    interface GigabitEthernet0/0.41
     encapsulation dot1Q 41
     ip address 10.0.41.1 255.255.255.0
     ip router isis 1
    !
    router isis 1
     net 49.0001.0000.0000.0010.00
     is-type level-2-only
     metric-style wide
    !
    router bgp 64512
     bgp log-neighbor-changes
     redistribute isis 1 level-2 route-map IOS2_LOOPBACK
     neighbor 10.0.21.2 remote-as 64513
    !
    ip prefix-list 20.20.20.20 seq 5 permit 20.20.20.20/32
    !
    route-map IOS2_LOOPBACK permit 10
     match ip address prefix-list 20.20.20.20
    !
    route-map IOS2_LOOPBACK deny 20

    XR2 should now have the 20.20.20.20/32 prefix twice. Let’s check the route that XR2 chose:

    RP/0/0/CPU0:XR2#show bgp ipv4 un 20.20.20.20
    Mon Aug 11 18:29:47.825 UTC
    BGP routing table entry for 20.20.20.20/32
    Versions:
      Process           bRIB/RIB  SendTblVer
      Speaker                  5           5
    Last Modified: Aug 11 18:24:57.101 for 00:04:50
    Paths: (2 available, best #1)
      Advertised to update-groups (with more than one peer):
        0.1
      Path #1: Received by speaker 0
      Advertised to update-groups (with more than one peer):
        0.1
      64512
        10.0.12.1 from 10.0.12.1 (1.1.1.1)
          Origin incomplete, metric 0, localpref 100, valid, external, best, group-best, import-candidate, import suspect
          Received Path ID 0, Local Path ID 1, version 5
          Origin-AS validity: not-found
      Path #2: Received by speaker 0
      Not advertised to any peer
      64512
        10.0.21.1 from 10.0.21.1 (10.10.10.10)
          Origin incomplete, metric 0, localpref 100, valid, external, import-candidate, import suspect
          Received Path ID 0, Local Path ID 0, version 0
          Origin-AS validity: not-found

    Currently its going the correct way, however what happens if XR1′s route to 20.20.20.20/32 was increased?

    router isis 1
    !
     interface GigabitEthernet0/0/0/2
      address-family ipv4 unicast
       metric 100

    XR2 still sees the best route via XR1:

    RP/0/0/CPU0:XR2#show route ipv4 20.20.20.20
    Mon Aug 11 18:31:34.958 UTC
    
    Routing entry for 20.20.20.20/32
      Known via "bgp 64513", distance 20, metric 0
      Tag 64512, type external
      Installed Aug 11 18:24:57.065 for 00:06:37
      Routing Descriptor Blocks
        10.0.12.1, from 10.0.12.1, BGP external
          Route metric is 0
      No advertising protos.

    AIGP

    In order to send AIGP, you need to ensure that the AIGP metric is being set in your route-policy, as well as turn on the feature under the neighbour address family. I’ll be doing this on XR1:

    route-policy IOS2_LOOPBACK
      if destination in 20.20.20.20 then
        set aigp-metric igp-cost
      else
        drop
      endif
    end-policy
    !
    router bgp 64512
     !
     neighbor 10.0.12.2
      address-family ipv4 unicast
       aigp

    AIGP has just been added to legacy IOS on version 15.4(3)T which is a version I don’t have in my lab yet. Let’s take a look at the consequences of one setting this value and the other not.

    RP/0/0/CPU0:XR2#show route ipv4 20.20.20.20
    Mon Aug 11 18:42:42.952 UTC
    
    Routing entry for 20.20.20.20/32
      Known via "bgp 64513", distance 20, metric 120 (AIGP metric)
      Tag 64512, type external
      Installed Aug 11 18:35:09.393 for 00:07:33
      Routing Descriptor Blocks
        10.0.12.1, from 10.0.12.1, BGP external
          Route metric is 120

    IOS-XR is preferring the route with the AIGP metric set. You can see the metric value of 120 has been learned. It also sets the local route metric to 120. The update from IOS1 is not preffered so it seems like a non-aigp value is seen as worse than any aigp value that may be set.

    I’m going to swap out IOS1 with another IOS-XR box. This new XR box will be advertising the route with the same metric as IOS1 currently is.
    BGP AIGP 21 The Accumulated IGP Metric Attribute for BGP

    XR2 should now be seeing both AIGP values and choosing XR5 as the next-hop:

    RP/0/0/CPU0:XR2#show bgp ipv4 unicast 20.20.20.20/32
    Mon Aug 11 19:33:43.432 UTC
    BGP routing table entry for 20.20.20.20/32
    Versions:
      Process           bRIB/RIB  SendTblVer
      Speaker                  9           9
    Last Modified: Aug 11 19:33:33.101 for 00:00:10
    Paths: (2 available, best #2)
      Advertised to update-groups (with more than one peer):
        0.1
      Path #1: Received by speaker 0
      Not advertised to any peer
      64512
        10.0.12.1 from 10.0.12.1 (1.1.1.1)
          Origin incomplete, metric 0, localpref 100, aigp metric 120, valid, external, import suspect
          Received Path ID 0, Local Path ID 0, version 0
          Origin-AS validity: not-found
          Total AIGP metric 120
      Path #2: Received by speaker 0
      Advertised to update-groups (with more than one peer):
        0.1
      64512
        10.0.52.5 from 10.0.52.5 (5.5.5.5)
          Origin incomplete, metric 0, localpref 100, aigp metric 40, valid, external, best, group-best, import-candidate, import suspect
          Received Path ID 0, Local Path ID 1, version 9
          Origin-AS validity: not-found
          Total AIGP metric 40

    Once again, the local route metric has been set to match the AIGP metric:

    RP/0/0/CPU0:XR2#sh ip route 20.20.20.20
    Mon Aug 11 19:34:54.567 UTC
    
    Routing entry for 20.20.20.20/32
      Known via "bgp 64513", distance 20, metric 40 (AIGP metric)
      Tag 64512, type external
      Installed Aug 11 19:33:33.523 for 00:01:21
      Routing Descriptor Blocks
        10.0.52.5, from 10.0.52.5, BGP external
          Route metric is 40

    Various networking ramblings from Dual CCIE #38070 (R&S, SP) and JNCIE-SP #2227

    © 2009-2014 Darren O'Connor All Rights Reserved