Category Archives: Design

Debian/Ubuntu PMTUD & uRPF

I originally started my PMTUD posts using Ubuntu 14.04. Halfway through the post I simply could not get Ubuntu to change it’s MTU on receipt of ICMP fragmentation needed messages. I then tried Debian and it worked. Windows also had no issues changing it’s MTU.

Wanting to finish off the post I switched to Debian and then would investigate the fault later.

Let’s remind ourselves of the original topology:
pmtu 11 Debian/Ubuntu PMTUD & uRPF
Swap out Debian for Ubuntu in the above image.

When I initially started to test, I dropped the MTU between R1 and R2 to 1400. The link between R2 and R4 was kept at 1500. If the user requested a file from the server at this point, Ubuntu would attempt to send at 1500 and get it’s packet dropped at R1. R1 would send a Fragmentation Needed packet back to the Ubuntu server, which would adjust it’s MTU and then send at 1400.

When I changed the MTU between R1 and R2 back up to 1500 and dropped R2-R4 down to 1400, it no longer worked. Debian and Windows did work. I ran tcpdump on Ubuntu and confirmed that it was definitely getting Fragmentation Needed packets. Ubuntu was only acting on Fragmentation Needed packets if it came from it’s default gateway, R1. Any router further along in the path was getting it’s ICMP packets ignored.

In order to understand what the problem is I need to show more about the topology. While the above diagram shows how thins are connected for the most part, it is missing a couple of things. All the devices are running inside virtualbox linked to GNS3. eth1 of all the servers are connected to the above topology, while eth0 was connected via NAT to my host PC so I could install software:
PMTUD uRPF Debian/Ubuntu PMTUD & uRPF
Each device had a static route to 192.168.0.0/16 to go out eth1 while their default route was out eth0. Some of you may be sensing what the issue is already…

The point to point links between the virtual routers are using the 10.0.0.0/8 space.

If Ubuntu received an ICMP packet from 192.168.4.1, it’s local default gateway on R1, there were no issues. If it received a packet from R2 or R4′s local interfaces, the packet was dropped. Debian and Windows both didn’t have problems, even though they are configured the same way.

sysctl.conf

I’ve touched on sysctl.conf before the the PMTUD posts, but there is an important difference in the defaults of Ubuntu and Debian. Take a look at this.
Debian:
Screen Shot 2014 09 02 at 9.11.50 am Debian/Ubuntu PMTUD & uRPF
Ubuntu:
Screen Shot 2014 09 02 at 9.12.01 am Debian/Ubuntu PMTUD & uRPF

uRPF

Ubuntu has Unicast Reverse Path Forwarding on by default. Debian has it off by default. In sysctl.conf on both machines, the required configuration setting is commented out:
Screen Shot 2014 09 02 at 9.15.44 am Debian/Ubuntu PMTUD & uRPF
R2 was originating it’s ICMP packets from it’s local interface, 10.0.12.2 in my example. Ubuntu did receive that packet, but it failed the RPF check and so was ignored. To confirm I tested this in two different ways:

  • Add a static route to 10.0.0.0/8 out eth1
  • Disable uRPF check on Ubuntu

Each test individually allowed the original PMTUD to work. What’s odd is that the sysctl.conf file in Ubuntu says that you need to uncomment the lines to turn on uRPF, but it’s on by default. Uncommenting the lines and setting the value to 1 is the same as leaving them commented. In Debian the default is to disable uRPF. In that distro you would need to uncomment the uRPF lines and set the value to 1 to turn the feature on.

Conclusions

  • If a server is multi-homed, PMTUD could break if the ICMP message arrives on an interface that the server is not expecting.
  • If you do have a server multi-homed, it would probably be best to turn off uRPF

Fundamentals – PMTUD – IPv4 vs IPv6 – Part 2 of 2

This is a continuation of a post I started back here. Please read it first before starting below.

RFC 4821

Another workaround we can use is Packetization Layer Path MTU Discovery – RFC 4821. The RFC enables a host to mainly acts in one of two ways:

  • Use regular PMTUD. If no acknowledgments are received and no ICMP messages are received, start to probe.
  • Ignore regular PMTUD and always probe.

Probing is where the host will send a packet with the min MTU configured and then attempt to increase that size. If acknowledgements are received on the larger size, then try increase it again. Option 1 will wait for a timeout so on broken PMTUD paths it starts a bit slow. It will however use regular PMTUD whenever it can so it’s a lot more efficient. Option 2 simple probes all the time. It starts a bit quicker on smaller MTU paths, but the server is also sending smaller packets to ALL paths in the beginning. Much less efficient.

I’ll configure this in Debian and then go through Wireshark to show what’s going on. Add the commands net.ipv4.tcp_mtu_probing = 1 to /etc/sysctl.conf then reload sysctl:

root@debian1:~# sysctl -p
net.ipv4.tcp_mtu_probing = 1

Start the transfer and what does Wireshark show us:
Screen Shot 2014 08 29 at 3.43.59 pm Fundamentals   PMTUD – IPv4 vs IPv6 – Part 2 of 2
After the standard 3-way handshake, the server sends a number of 1514 byte packets. ICMP has been blocked and as such there are no ICMP fragmentation needed messages coming from R2. After 5.3 seconds the server sends a number of 578 byte packets.
Screen Shot 2014 08 29 at 3.43.24 pm Fundamentals   PMTUD – IPv4 vs IPv6 – Part 2 of 2
These get ACK’d correctly:
Screen Shot 2014 08 29 at 3.44.59 pm Fundamentals   PMTUD – IPv4 vs IPv6 – Part 2 of 2
0.5 seconds later the server sends a single 1090 byte packets and fill the rest of the window with 578 byte packets. As soon as the ACK for that big packet comes back, the server sends all of its packets at 1090:
Screen Shot 2014 08 29 at 3.47.31 pm Fundamentals   PMTUD – IPv4 vs IPv6 – Part 2 of 2
Screen Shot 2014 08 29 at 3.48.01 pm Fundamentals   PMTUD – IPv4 vs IPv6 – Part 2 of 2

A couple of things to note about this setting in Ubuntu 14.04 and Debian 7.6.0:

  1. The system does not cache the MTU of the path found through PLPMTUD. This does mean that if you have a host making multiple TCP connections to your server over a small MTU path, each one of those are going to need to wait for the timeout.
  2. There is no net.ipv6.tcp_mtu_probing setting in sysctl.conf. However if you enable this setting for IPv4 then IPv6 has the same behavior as IPv4:

Screen Shot 2014 08 29 at 3.54.50 pm Fundamentals   PMTUD – IPv4 vs IPv6 – Part 2 of 2

Windows can also be configured for PLPMTUD but I’ll leave it up to the reader to figure out how to do that.

PMTUD Cache

I showed in part 1 that the server will cache an entry if the MTU is lower than the local link. By default, Debian will cache this entry for 10 minutes. This time is adjustable via sysctl.conf:

root@debian1:~# sysctl -a | grep mtu_expires
net.ipv4.route.mtu_expires = 600
net.ipv6.route.mtu_expires = 600

As soon as a value is cached, the timer starts. This timer counts down even if there is an existing file transfer. The reason is because paths can change. While the transfer is going on it could move to a path which has no MTU issues. We would want the server to then increase it’s MTU. Doing this too quickly can cause more traffic to drop and so the suggestion is to cache the MTU for 10 whole minutes and then try to increase. I’ve started a file transfer which is ongoing and then checked the cache entry on the server. You can see the timer going down:
Screen Shot 2014 09 01 at 1.22.23 pm Fundamentals   PMTUD – IPv4 vs IPv6 – Part 2 of 2
The client has then finished downloading and disconnected from the server. At this point the server still keeps that cache entry. This ensures that if the client connects again shortly it will start with an MTU of 1400:
Screen Shot 2014 09 01 at 1.30.12 pm Fundamentals   PMTUD – IPv4 vs IPv6 – Part 2 of 2
I’ve started a new download within the cache time above and we can see the server immediately starts sending packets with the correct MTU:
Screen Shot 2014 09 01 at 1.37.23 pm Fundamentals   PMTUD – IPv4 vs IPv6 – Part 2 of 2

What should happen when the cache times out is that the server should try to send a larger MTU packet, up to the local MTU. I don’t see that with Debian though. I started the test with the lower MTU cached on my server. When the cache was about to expire above I started the test again and as expected the session starts with the lower cached MTU. I then changed the MTU between R2 and R5 back up to the regular MTU:

R2(config)#int fa0/1
R2(config-if)#no ip mtu 1400
R2(config-if)#end

The odd thing is, when the cache entry timed out, Debian carried on sending packets with an MTU of 1400 and cached the entry again. That’s not supposed to happen.

I then tried the same test again, this time manually clearing the cache on Debian:

root@debian1:~#ip route flush cache

This time the server immediately started to send larger packets:
Screen Shot 2014 09 01 at 2.16.48 pm Fundamentals   PMTUD – IPv4 vs IPv6 – Part 2 of 2

IPv6 has roughly the same broken behavior. At first the cache is created and starts to count down. I started a transfer when it was about to expire. This time it again stayed at 1400, but the timer jumped into a huge number:
Screen Shot 2014 09 01 at 3.59.22 pm Fundamentals   PMTUD – IPv4 vs IPv6 – Part 2 of 2
8590471 seconds is roughly 99 days. Not sure if this is a bug or what exactly.

Clearing the IPv6 cache on the other hand had the required effect:
Screen Shot 2014 09 01 at 4.04.20 pm Fundamentals   PMTUD – IPv4 vs IPv6 – Part 2 of 2
If the MTU matches the outgoing interface, there is no need for the system to cache that entry taking up more resources on the server. Wireshark shows the jump in MTU:
Screen Shot 2014 09 01 at 4.05.32 pm Fundamentals   PMTUD – IPv4 vs IPv6 – Part 2 of 2

Conclusions

  • Blocking the required ICMP packets breaks PMTUD completely.
  • There are alternatives to PMTUD, but they are slower initially.
  • Test your OS’s behavior. I mainly tested with Debian and I ran into a number of ‘odd’ scenarios. Mainly to do with the cache.

Demystifying the IS-IS database

I’ve gone over the OSPFv2 and OSPFv3 databases in depth before. Now is the time for IS-IS. As always, I’ll start from a basic two router set up and add devices to the topology.

Basic LSPs

In OSPF we use the term LSA, Link-State Advertisement. In IS-IS we use the term LSP – Link-State PDUs. Further expanded into Link-State Protocol Data Units. Not to be confused with Label Switched Paths.

This is the topology we’ll start with:
IS IS 1 Demystifying the IS IS database
Like OSPF, IS-IS will treat ethernet links as broadcast by default. In OSPF a DR and BDR will be elected. In IS-IS a single DIS (Designated Intermediate System) is elected with no backup DIS. This DIS election is also pre-emtptive, unlike OSPF. The DIS will originate an LSP representing the DIS. This means I should have three LSPs in the database currently:

RP/0/0/CPU0:XR1#show isis database
Tue Aug 12 17:34:21.594 UTC

IS-IS 1 (Level-2) Link State Database
LSPID                 LSP Seq Num  LSP Checksum  LSP Holdtime  ATT/P/OL
XR1.00-00           * 0x00000003   0x8577        736             0/0/0
XR1.01-00             0x00000002   0x1fba        931             0/0/0
XR2.00-00             0x00000005   0x856b        806             0/0/0

 Total Level-2 LSP count: 3     Local Level-2 LSP count: 1

XR2 has a single LSP with XR1 has two. The XR1.01 LSP is the DIS LSP. Dig deeper into the LSPs to see their current content:

RP/0/0/CPU0:XR1#show isis database XR1.00-00 detail
Tue Wed 12 17:38:23.307 UTC

IS-IS 1 (Level-2) Link State Database
LSPID                 LSP Seq Num  LSP Checksum  LSP Holdtime  ATT/P/OL
XR1.00-00           * 0x00000003   0x8577        494             0/0/0
  Area Address: 49.0001
  NLPID:        0xcc
  Hostname:     XR1
  IP Address:   1.1.1.1
  Metric: 10         IS XR1.01
  Metric: 10         IP 1.1.1.1/32
  Metric: 10         IP 10.0.12.0/24

XR1 has originated an LSP stating what area it’s in and hostname. Notice the NLPID value. This means Network Layer Protocol IDentifier. The value of 0xcc translates to IPv4. Further down the LSP contains the IS of XR1 itself, plus two IP ranges. All these with metrics to those IS and IPs. I’ll get onto the ATT/P/OL bits later so ignore those for now.

It’s important to note that an LSP is made up of several TLVs. On the wire multiple TLVs can be grouped together in a single frame. If large enough, IS-IS will fragment these frames.

As XR1 is the DIS, there is a separate DIS LSP, let’s take a look at that:

RP/0/0/CPU0:XR1#show isis database XR1.01-00 detail
Tue Aug 12 17:43:00.448 UTC

IS-IS 1 (Level-2) Link State Database
LSPID                 LSP Seq Num  LSP Checksum  LSP Holdtime  ATT/P/OL
XR1.01-00             0x00000003   0x1dbb        1161            0/0/0
  Metric: 0          IS XR1.00
  Metric: 0          IS XR2.00

The DIS LSP advertises all the IS’ that are on the segment in which the DIS sits.

If I change the segment to point-to-point, this removes the need of a DIS and as such there will be no DIS LSP.

router isis 1
!
 interface GigabitEthernet0/0/0/1
  point-to-point
RP/0/0/CPU0:XR1#show isis database
Tue Aug 12 18:46:50.566 UTC

IS-IS 1 (Level-2) Link State Database
LSPID                 LSP Seq Num  LSP Checksum  LSP Holdtime  ATT/P/OL
XR1.00-00           * 0x0000000b   0x7480        674             0/0/0
XR2.00-00             0x0000000d   0x5297        543             0/0/0

 Total Level-2 LSP count: 2     Local Level-2 LSP count: 1

Externals

I’m going to add another loopback interface on XR1 and redistribute that loopback into IS-IS. This will make the route external

interface Loopback100
 ipv4 address 100.100.100.100 255.255.255.255
!
prefix-set LOOPBACK100
  100.100.100.100/32
end-set
!
route-policy RP-100
  if destination in LOOPBACK100 then
    done
  else
    drop
  endif
end-policy
!
router isis 1
 address-family ipv4 unicast
  redistribute connected level-2 route-policy RP-100

As I mentioned above, IS-IS has separate TLVs that make up the LSP. Therefore there is still only a single LSP from XR1:

RP/0/0/CPU0:XR2#sh isis database
Tue Aug 12 19:03:31.569 UTC

IS-IS 1 (Level-2) Link State Database
LSPID                 LSP Seq Num  LSP Checksum  LSP Holdtime  ATT/P/OL
XR1.00-00             0x0000000d   0x6be5        1043            0/0/0
XR2.00-00           * 0x00000010   0x9c8f        1094            0/0/0

 Total Level-2 LSP count: 2     Local Level-2 LSP count: 1

The external route can be seen in the detailed output under that LSP:

RP/0/0/CPU0:XR2#sh isis database XR1.00-00 detail
Tue Aug 12 19:03:58.637 UTC

IS-IS 1 (Level-2) Link State Database
LSPID                 LSP Seq Num  LSP Checksum  LSP Holdtime  ATT/P/OL
XR1.00-00             0x0000000d   0x6be5        1016            0/0/0
  Area Address: 49.0001
  NLPID:        0xcc
  Hostname:     XR1
  IP Address:   1.1.1.1
  Metric: 10         IS XR2.00
  Metric: 10         IP 1.1.1.1/32
  Metric: 10         IP 10.0.12.0/24
  Metric: 0          IP-External 100.100.100.100/32

Inter-Area

XR3 has now been added to the topology. I’ve had to move XR2 into the same area as XR3 otherwise they will not be able to form a L1 adjacency:
IS IS 2 Demystifying the IS IS database

the R2-R3 link has not been changed to point-to-point, and as such I would expect to see three LSPs in XR3s database:

RP/0/0/CPU0:XR3#show isis database
Tue Aug 12 09:44:40.660 UTC

IS-IS 1 (Level-1) Link State Database
LSPID                 LSP Seq Num  LSP Checksum  LSP Holdtime  ATT/P/OL
XR2.00-00             0x00000008   0xd230        1107            1/0/0
XR3.00-00           * 0x00000008   0xf1be        1105            0/0/0
XR3.07-00             0x00000003   0xfcd3        1105            0/0/0

 Total Level-1 LSP count: 3     Local Level-1 LSP count: 1

If you look at XR2′s L1 LSP in detail you now see the ATT bit set. Also note it’s advertising only it’s directly connected interfaces:

RP/0/0/CPU0:XR3#show isis database XR2.00-00 detail
Tue Aug 12 19:45:51.025 UTC

IS-IS 1 (Level-1) Link State Database
LSPID                 LSP Seq Num  LSP Checksum  LSP Holdtime  ATT/P/OL
XR2.00-00             0x00000008   0xd230        1037            1/0/0
  Area Address: 49.0023
  NLPID:        0xcc
  Hostname:     XR2
  IP Address:   2.2.2.2
  Metric: 10         IS XR3.07
  Metric: 10         IP 2.2.2.2/32
  Metric: 10         IP 10.0.12.0/24
  Metric: 10         IP 10.0.23.0/24

XR2 has set the ATT bit which is the attached bit. An L1/L2 router will set this bit in the LSP inside the L1 area it’s connected to. This is to inform the L1 routers that it is attached to the L2 domain. No actual default route is advertised, but L1 routers can create their own defaults pointing towards the attached routers:

RP/0/0/CPU0:XR3#sh ip route 0.0.0.0
Tue Aug 12 19:47:07.839 UTC

Routing entry for 0.0.0.0/0
  Known via "isis 1", distance 115, metric 10, candidate default path, type level-1
  Installed Aug 12 19:43:09.476 for 00:03:58
  Routing Descriptor Blocks
    10.0.23.2, from 2.2.2.2, via GigabitEthernet0/0/0/0.23
      Route metric is 10
  No advertising protos.

Notice from XR1′s persepctive, that any routes coming from an L1 area is simple flooded from the L1/L2 router as normal routes:

RP/0/0/CPU0:XR1#show isis database XR2.00-00 detail
Tue Aug 12 19:50:08.676 UTC

IS-IS 1 (Level-2) Link State Database
LSPID                 LSP Seq Num  LSP Checksum  LSP Holdtime  ATT/P/OL
XR2.00-00             0x0000001b   0x5b3d        778             0/0/0
  Area Address: 49.0023
  NLPID:        0xcc
  Hostname:     XR2
  IP Address:   2.2.2.2
  Metric: 10         IS XR1.00
  Metric: 10         IP 2.2.2.2/32
  Metric: 20         IP 3.3.3.3/32
  Metric: 10         IP 10.0.12.0/24
  Metric: 10         IP 10.0.23.0/24
  Metric: 10         IP 200.200.200.200/32

IS-IS gives you the ability to leak L2 prefixes into the L1 domain. This is handy when you have two L1/L2 border routers and want to engineer destiations to go on particular paths. From XR2 I’ll leak XR1′s loopback into the L1 domain. The database now shows:

RP/0/0/CPU0:XR3#show isis database XR2.00-00 detail
Tue Aug 12 21:53:13.981 UTC

IS-IS 1 (Level-1) Link State Database
LSPID                 LSP Seq Num  LSP Checksum  LSP Holdtime  ATT/P/OL
XR2.00-00             0x0000002f   0x4e13        1193            1/0/0
  Area Address: 49.0023
  NLPID:        0xcc
  Hostname:     XR2
  IP Address:   2.2.2.2
  Router Cap:   2.2.2.2, D:0, S:0
  Metric: 10         IS XR3.07
  Metric: 20         IP-Interarea 1.1.1.1/32
  Metric: 10         IP 2.2.2.2/32
  Metric: 10         IP 10.0.23.0/24

1.1.1.1/32 shows up in LSP as an IP-Interarea route. Again a TLV is used for this.

IPv6

When running both IPv4 and IPv6 at the same time, IS-IS can be run in single-topology or multi-topolgy mode. In single topology, all your IS-IS links need to have both v4 and v6 addresses as the SPF tree is run indenpently of prefix information. If the SPF tree is calculated to use a link without a v6 address, IPv6 traffic will be blackholed over that link.

For now I’ve added an IPv6 loopback and interface address. I’ve got IS-IS running in multi topology mode. I should still only see two LSPs from XR1′s perspective:

RP/0/0/CPU0:XR1#show isis database
Tue Aug 12 23:47:02.152 UTC

IS-IS 1 (Level-2) Link State Database
LSPID                 LSP Seq Num  LSP Checksum  LSP Holdtime  ATT/P/OL
XR1.00-00           * 0x0000001e   0x9683        1115            0/0/0
XR2.00-00             0x0000002b   0x62fa        1117            0/0/0

 Total Level-2 LSP count: 2     Local Level-2 LSP count: 1

IPv6 information is carried inside another TLV. Note also that there is a new NLPID value of 0x8e in the LSP. As you would guess this value represents IPv6:

RP/0/0/CPU0:XR1#show isis database detail XR2.00-00
Tue Aug 12 23:47:50.899 UTC

IS-IS 1 (Level-2) Link State Database
LSPID                 LSP Seq Num  LSP Checksum  LSP Holdtime  ATT/P/OL
XR2.00-00             0x0000002b   0x62fa        1068            0/0/0
  Area Address: 49.0023
  NLPID:        0xcc
  NLPID:        0x8e
  MT:           Standard (IPv4 Unicast)
  MT:           IPv6 Unicast                                     0/0/0
  Hostname:     XR2
  IP Address:   2.2.2.2
  IPv6 Address: 2001:db8:2:2::2
  Metric: 10         IS XR1.00
  Metric: 10         IP 2.2.2.2/32
  Metric: 20         IP 3.3.3.3/32
  Metric: 10         IP 10.0.12.0/24
  Metric: 10         IP 10.0.23.0/24
  Metric: 10         IP 200.200.200.200/32
  Metric: 10         MT (IPv6 Unicast) IS-Extended XR1.00
  Metric: 10         MT (IPv6 Unicast) IPv6 2001:db8:2:2::2/128
  Metric: 10         MT (IPv6 Unicast) IPv6 2001:db8:12::/64

When running multi-topology mode, you’ll see MT: plus the address families configured for multi-topology. If I change this to single topology:

RP/0/0/CPU0:XR1#show isis database XR2.00-00 detail
Tue Aug 12 23:11:20.989 UTC

IS-IS 1 (Level-2) Link State Database
LSPID                 LSP Seq Num  LSP Checksum  LSP Holdtime  ATT/P/OL
XR2.00-00             0x00000023   0xd22a        1196            0/0/0
  Area Address: 49.0023
  NLPID:        0xcc
  NLPID:        0x8e
  Hostname:     XR2
  IP Address:   2.2.2.2
  IPv6 Address: 2001:db8:2:2::2
  Metric: 10         IS XR1.00
  Metric: 10         IP 2.2.2.2/32
  Metric: 10         IP 10.0.12.0/24
  Metric: 10         IP 10.0.23.0/24
  Metric: 10         IP 200.200.200.200/32
  Metric: 10         IPv6 2001:db8:2:2::2/128
  Metric: 10         IPv6 2001:db8:12::/64

MT no longer shows up, and all TLVs are added as-is to the LSP.

Traffic Engineering

To enable TE, wide-metrics need to be enabled. Up until this point I’ve been using narrow metrics. Once enabled You can see the TE information in the LSP by doing a verbose output:

RP/0/0/CPU0:XR1#show isis database verbose XR2.00-00
Tue Aug 12 23:42:09.932 UTC

IS-IS 1 (Level-2) Link State Database
LSPID                 LSP Seq Num  LSP Checksum  LSP Holdtime  ATT/P/OL
XR2.00-00             0x00000026   0x2dd8        910             0/0/0
  Area Address: 49.0023
  NLPID:        0xcc
  NLPID:        0x8e
  Hostname:     XR2
  IP Address:   2.2.2.2
  IPv6 Address: 2001:db8:2:2::2
  Router ID:    2.2.2.2
  Metric: 10         IS-Extended XR1.00
    Affinity: 0x00000000
    Interface IP Address: 10.0.12.2
    Neighbor IP Address: 10.0.12.1
    Physical BW: 1000000 kbits/sec
    Reservable Global pool BW: 0 kbits/sec
    Global Pool BW Unreserved:
      [0]: 0        kbits/sec          [1]: 0        kbits/sec
      [2]: 0        kbits/sec          [3]: 0        kbits/sec
      [4]: 0        kbits/sec          [5]: 0        kbits/sec
      [6]: 0        kbits/sec          [7]: 0        kbits/sec
    Admin. Weight: 167772160
    Ext Admin Group: Length: 32
      0x00000000   0x00000000
      0x00000000   0x00000000
      0x00000000   0x00000000
      0x00000000   0x00000000
  Metric: 10         IP-Extended 2.2.2.2/32
  Metric: 10         IP-Extended 10.0.12.0/24
  Metric: 10         IP-Extended 10.0.23.0/24
  Metric: 10         IP-Extended 200.200.200.200/32
  Metric: 10         IPv6 2001:db8:2:2::2/128
  Metric: 10         IPv6 2001:db8:12::/64

Notice there there is no new NLPID value for TE. TE extensions are enabled under address-family ipv4 and hence it uses the 0xcc id. If/when RSVP-TE can use IPv6 natively, I could expect to see only the IPv6 ID.

Overload

IS-IS has the ability to set the overload bit in an LSP. This could be originated by the router itself if it was overwhelmed, but it can also be hard set when doing planned works for example. If the overload bit is set, other routers will route around the router.

router isis 1
 set-overload-bit

Note that OL bit set in the LSP:

RP/0/0/CPU0:XR1#show isis database
Tue Aug 12 23:32:58.107 UTC

IS-IS 1 (Level-2) Link State Database
LSPID                 LSP Seq Num  LSP Checksum  LSP Holdtime  ATT/P/OL
XR1.00-00           * 0x0000001f   0x9484        947             0/0/0
XR2.00-00             0x0000002e   0x97a4        1151            0/0/1

 Total Level-2 LSP count: 2     Local Level-2 LSP count: 1

I no longer have access to R3 now as R2 is the only router connecting these two devices:

RP/0/0/CPU0:XR1#ping 3.3.3.3
Tue Aug 12 23:08:44.083 UTC
Type escape sequence to abort.
Sending 5, 100-byte ICMP Echos to 3.3.3.3, timeout is 2 seconds:
UUUUU
Success rate is 0 percent (0/5)

I am still able to ping XR2 itself though:

RP/0/0/CPU0:XR1#ping 2.2.2.2
Tue Aug 12 23:09:32.870 UTC
Type escape sequence to abort.
Sending 5, 100-byte ICMP Echos to 2.2.2.2, timeout is 2 seconds:
!!!!!
Success rate is 100 percent (5/5), round-trip min/avg/max = 1/1/1 ms

We’ve now seen the purpose of both the ATT and OL bits, so what is the P bit for? that bit is for the Partition Repair Bit which no vendor has implemented. i.e. it should always show 0.

Segment Routing

IS-IS is easily extended using new TLVs. If I enable segment routing under my IS-IS process, I see it added as a new TLV in the LSP:

RP/0/0/CPU0:XR1#show isis database verbose XR2.00-00
Tue Aug 12 23:50:35.855 UTC

IS-IS 1 (Level-2) Link State Database
LSPID                 LSP Seq Num  LSP Checksum  LSP Holdtime  ATT/P/OL
XR2.00-00             0x00000036   0x252b        954             0/0/0
  Area Address: 49.0023
  NLPID:        0xcc
  NLPID:        0x8e
  MT:           Standard (IPv4 Unicast)
  MT:           IPv6 Unicast                                     0/0/0
  Hostname:     XR2
  IP Address:   2.2.2.2
  IPv6 Address: 2001:db8:2:2::2
  Router Cap:   2.2.2.2, D:0, S:0
    Segment Routing: I:1 V:0, SRGB Base: 900000 Range: 65535
  Metric: 10         IS XR1.00
  Metric: 10         IP 2.2.2.2/32
  Metric: 20         IP 3.3.3.3/32
  Metric: 10         IP 10.0.12.0/24
  Metric: 10         IP 10.0.23.0/24
  Metric: 10         IP 200.200.200.200/32
  Metric: 10         MT (IPv6 Unicast) IS-Extended XR1.00
  Metric: 10         MT (IPv6 Unicast) IPv6 2001:db8:2:2::2/128
  Metric: 10         MT (IPv6 Unicast) IPv6 2001:db8:12::/64

The Accumulated IGP Metric Attribute for BGP

This is an interesting draft which can ensure better paths are chosen in certain corner cases. Before this draft, BGP was able to redistribute the IGP metric as a MED value into BGP. The issue with MED is that it’s very low on the BGP best path algorithm. Note that Cisco/Brocade consider weight as primary, but I’ll ignore that for now

  1. Highest Local-Preference
  2. Shortest AS-Path
  3. Lowest Origin Code
  4. Lowest MED
  5. ETC

MED is only number 4 in the pecking order. In a large network it might be difficult to get everything to match up to that point. Accumulated IGP Metric is a new non-transitive BGP path attribute that carries the IGP metric inside the BGP NLRI. Not only that, but the best-path algorithms are changed as follows:

  1. Highest Local-Preference
  2. Lowest AIGP Cost
  3. Shortest AS-Path
  4. Lowest Origin Code
  5. Lowest MED
  6. ETC

As long as your local-preference values match, the lowest AIGP cost is taken into account.

No AIGP

Take the following topology into consideration:
BGP AIGP The Accumulated IGP Metric Attribute for BGP
Assuming all link costs are the same, the shortest path for XR2 to get to IOS2 is via path XR1-XR4-IOS2. I’m going to ignore MED on XR2 for now.

Quick relevant configs on XR1 and IOS1:

interface GigabitEthernet0/0/0/1
 description Link to XR2
 ipv4 address 10.0.12.1 255.255.255.0
!
interface GigabitEthernet0/0/0/2
 description Link to XR3
 ipv4 address 10.0.13.1 255.255.255.0
!
prefix-set 20.20.20.20
  20.20.20.20/32
end-set
!
route-policy PASS
  pass
end-policy
!
route-policy IOS2_LOOPBACK
  if destination in 20.20.20.20 then
    done
  else
    drop
  endif
end-policy
!
router isis 1
 is-type level-2-only
 net 49.0001.0000.0000.0001.00
 address-family ipv4 unicast
  metric-style wide
 !
 interface Loopback0
  address-family ipv4 unicast
  !
 !
 interface GigabitEthernet0/0/0/2
  address-family ipv4 unicast
  !
 !
!
router bgp 64512
 address-family ipv4 unicast
  redistribute isis 1 route-policy IOS2_LOOPBACK
 !
 neighbor 10.0.12.2
  remote-as 64513
  address-family ipv4 unicast
   route-policy PASS in
   route-policy PASS out
  !
 !
!
interface Loopback0
 ip address 10.10.10.10 255.255.255.255
 ip router isis 1
!
interface GigabitEthernet0/0
!
interface GigabitEthernet0/0.41
 encapsulation dot1Q 41
 ip address 10.0.41.1 255.255.255.0
 ip router isis 1
!
router isis 1
 net 49.0001.0000.0000.0010.00
 is-type level-2-only
 metric-style wide
!
router bgp 64512
 bgp log-neighbor-changes
 redistribute isis 1 level-2 route-map IOS2_LOOPBACK
 neighbor 10.0.21.2 remote-as 64513
!
ip prefix-list 20.20.20.20 seq 5 permit 20.20.20.20/32
!
route-map IOS2_LOOPBACK permit 10
 match ip address prefix-list 20.20.20.20
!
route-map IOS2_LOOPBACK deny 20

XR2 should now have the 20.20.20.20/32 prefix twice. Let’s check the route that XR2 chose:

RP/0/0/CPU0:XR2#show bgp ipv4 un 20.20.20.20
Mon Aug 11 18:29:47.825 UTC
BGP routing table entry for 20.20.20.20/32
Versions:
  Process           bRIB/RIB  SendTblVer
  Speaker                  5           5
Last Modified: Aug 11 18:24:57.101 for 00:04:50
Paths: (2 available, best #1)
  Advertised to update-groups (with more than one peer):
    0.1
  Path #1: Received by speaker 0
  Advertised to update-groups (with more than one peer):
    0.1
  64512
    10.0.12.1 from 10.0.12.1 (1.1.1.1)
      Origin incomplete, metric 0, localpref 100, valid, external, best, group-best, import-candidate, import suspect
      Received Path ID 0, Local Path ID 1, version 5
      Origin-AS validity: not-found
  Path #2: Received by speaker 0
  Not advertised to any peer
  64512
    10.0.21.1 from 10.0.21.1 (10.10.10.10)
      Origin incomplete, metric 0, localpref 100, valid, external, import-candidate, import suspect
      Received Path ID 0, Local Path ID 0, version 0
      Origin-AS validity: not-found

Currently its going the correct way, however what happens if XR1′s route to 20.20.20.20/32 was increased?

router isis 1
!
 interface GigabitEthernet0/0/0/2
  address-family ipv4 unicast
   metric 100

XR2 still sees the best route via XR1:

RP/0/0/CPU0:XR2#show route ipv4 20.20.20.20
Mon Aug 11 18:31:34.958 UTC

Routing entry for 20.20.20.20/32
  Known via "bgp 64513", distance 20, metric 0
  Tag 64512, type external
  Installed Aug 11 18:24:57.065 for 00:06:37
  Routing Descriptor Blocks
    10.0.12.1, from 10.0.12.1, BGP external
      Route metric is 0
  No advertising protos.

AIGP

In order to send AIGP, you need to ensure that the AIGP metric is being set in your route-policy, as well as turn on the feature under the neighbour address family. I’ll be doing this on XR1:

route-policy IOS2_LOOPBACK
  if destination in 20.20.20.20 then
    set aigp-metric igp-cost
  else
    drop
  endif
end-policy
!
router bgp 64512
 !
 neighbor 10.0.12.2
  address-family ipv4 unicast
   aigp

AIGP has just been added to legacy IOS on version 15.4(3)T which is a version I don’t have in my lab yet. Let’s take a look at the consequences of one setting this value and the other not.

RP/0/0/CPU0:XR2#show route ipv4 20.20.20.20
Mon Aug 11 18:42:42.952 UTC

Routing entry for 20.20.20.20/32
  Known via "bgp 64513", distance 20, metric 120 (AIGP metric)
  Tag 64512, type external
  Installed Aug 11 18:35:09.393 for 00:07:33
  Routing Descriptor Blocks
    10.0.12.1, from 10.0.12.1, BGP external
      Route metric is 120

IOS-XR is preferring the route with the AIGP metric set. You can see the metric value of 120 has been learned. It also sets the local route metric to 120. The update from IOS1 is not preffered so it seems like a non-aigp value is seen as worse than any aigp value that may be set.

I’m going to swap out IOS1 with another IOS-XR box. This new XR box will be advertising the route with the same metric as IOS1 currently is.
BGP AIGP 21 The Accumulated IGP Metric Attribute for BGP

XR2 should now be seeing both AIGP values and choosing XR5 as the next-hop:

RP/0/0/CPU0:XR2#show bgp ipv4 unicast 20.20.20.20/32
Mon Aug 11 19:33:43.432 UTC
BGP routing table entry for 20.20.20.20/32
Versions:
  Process           bRIB/RIB  SendTblVer
  Speaker                  9           9
Last Modified: Aug 11 19:33:33.101 for 00:00:10
Paths: (2 available, best #2)
  Advertised to update-groups (with more than one peer):
    0.1
  Path #1: Received by speaker 0
  Not advertised to any peer
  64512
    10.0.12.1 from 10.0.12.1 (1.1.1.1)
      Origin incomplete, metric 0, localpref 100, aigp metric 120, valid, external, import suspect
      Received Path ID 0, Local Path ID 0, version 0
      Origin-AS validity: not-found
      Total AIGP metric 120
  Path #2: Received by speaker 0
  Advertised to update-groups (with more than one peer):
    0.1
  64512
    10.0.52.5 from 10.0.52.5 (5.5.5.5)
      Origin incomplete, metric 0, localpref 100, aigp metric 40, valid, external, best, group-best, import-candidate, import suspect
      Received Path ID 0, Local Path ID 1, version 9
      Origin-AS validity: not-found
      Total AIGP metric 40

Once again, the local route metric has been set to match the AIGP metric:

RP/0/0/CPU0:XR2#sh ip route 20.20.20.20
Mon Aug 11 19:34:54.567 UTC

Routing entry for 20.20.20.20/32
  Known via "bgp 64513", distance 20, metric 40 (AIGP metric)
  Tag 64512, type external
  Installed Aug 11 19:33:33.523 for 00:01:21
  Routing Descriptor Blocks
    10.0.52.5, from 10.0.52.5, BGP external
      Route metric is 40

OSPF Enhancements in recent IOS versions

OSPFv3 Authentication Trailer

In 2011 I wrote an article showing that in order to provide authenticated OSPFv3 neighbour sessions, you needed the security license on IOS.

Manav Bhatia commented on that post stating they were working on an IETF standard to fix this. That draft became RFC6506 and then RFC7166

Cisco has added support for RFC7166 as of IOS 15.4(2)T and IOS-XE 3.11S

Configuration is very quick and easy. Note that OSPFv3 authentication headers do not support md5 according to the RFC. If you configure your key chain with md5, it will not work.
OSPFv3 AUTH OSPF Enhancements in recent IOS versions

R1#sh run int gi0/0.12
Building configuration...

Current configuration : 125 bytes
!
interface GigabitEthernet0/0.12
 encapsulation dot1Q 12
 ipv6 address 2001:DB8:12:0:10:1:2:1/64
 ospfv3 1 ipv6 area 0
end

Standard interface config. I’ll now configure the key chain and authenticate ensure all area 0 adjacencies:

R1#sh run | sec key chain
key chain AUTH
 key 1
  key-string RFC
  cryptographic-algorithm hmac-sha-512

R1#sh run | sec router ospfv3
router ospfv3 1
 router-id 1.1.1.1
 !
 address-family ipv6 unicast
  authentication mode strict
  area 0 authentication key-chain AUTH
 exit-address-family

Verify:

R1#show ospfv3 interface
GigabitEthernet0/0.12 is up, line protocol is up
  Link Local Address FE80::A8AA:11FF:FE11:1111, Interface ID 15
  Area 0, Process ID 1, Instance ID 0, Router ID 1.1.1.1
  Network Type BROADCAST, Cost: 1
  Cryptographic authentication enabled with strict key lifetime
    Sending SA: Key 1, Algorithm HMAC-SHA-512 - key chain AUTH
  Transmit Delay is 1 sec, State BDR, Priority 1
  Designated Router (ID) 2.2.2.2, local address FE80::A8AA:22FF:FE22:2222
  Backup Designated router (ID) 1.1.1.1, local address FE80::A8AA:11FF:FE11:1111
  Timer intervals configured, Hello 10, Dead 40, Wait 40, Retransmit 5
    Hello due in 00:00:06
  Graceful restart helper support enabled
  Index 1/1/1, flood queue length 0
  Next 0x0(0)/0x0(0)/0x0(0)
  Last flood scan length is 2, maximum is 2
  Last flood scan time is 0 msec, maximum is 0 msec
  Neighbor Count is 1, Adjacent neighbor count is 1
    Adjacent with neighbor 2.2.2.2  (Designated Router)
  Suppress hello for 0 neighbor(s)


R1#show ospfv3 neighbor

          OSPFv3 1 address-family ipv6 (router-id 1.1.1.1)

Neighbor ID     Pri   State           Dead Time   Interface ID    Interface
2.2.2.2           1   FULL/DR         00:00:33    15              GigabitEthernet0/0.12

Oddly, IOS-XR 5.2.0 still does not support this RFC. Only the previous IPSec authentication:

RP/0/0/CPU0:XR6(config-ospfv3-ar)#authentication ?
  disable  Do not authenticate OSPFv3 packets
  ipsec    Use IPSec AH authentication

OSPFv2 Multiarea Adjacency

In 2012 I wrote another post explaining the problem of suboptimal routing in OSPFv2. RFC5185 was created to allow a single interface to be in multiple areas. At the time of writing that original post, this feature was only in Junos and IOS-XE. This has now been added to IOS 15.4(1)T I recommend you read the above post first to understand the issue.

I’ll use a similar topology to that original post. I’ve substituted R5 and R6 with IOS-XR boxes:
OSPF MA1 OSPF Enhancements in recent IOS versions

XR5 goes over the primary link, but XR6 goes over the backup:

RP/0/0/CPU0:XR5#traceroute 4.4.4.4
Thu Aug  7 21:39:21.188 UTC

Type escape sequence to abort.
Tracing the route to 4.4.4.4

 1  10.0.15.1 9 msec  0 msec  0 msec
 2  10.0.14.4 0 msec  *  0 msec


RP/0/0/CPU0:XR6#traceroute 4.4.4.4
Thu Aug  7 21:39:31.999 UTC

Type escape sequence to abort.
Tracing the route to 4.4.4.4

 1  10.0.26.2 0 msec  0 msec  0 msec
 2  10.0.24.4 0 msec  *  0 msec

Configuration is pretty simple. Add the original area plus the second area to the interface needed:

interface GigabitEthernet0/0.13
 encapsulation dot1Q 13
 ip address 10.0.13.1 255.255.255.0
 ip ospf multi-area 4
 ip ospf 1 area 0

To verify:

R1#show ip ospf 1 multi-area
OSPF_MA0 is down, line protocol is down
  Primary Interface GigabitEthernet0/0.13, Area 4
  Interface ID 17
  MTU is 1500 bytes
  Interface DOWN as link is not P2P

An interesting caveat, the interface needs to be in point-to-point mode for this to work:

R1(config)#int gi0/0.13
R1(config-subif)#ip ospf net point-to-point

Once I’ve made the above changes on R1, R2, and R3:

R1#show ip ospf 1 multi-area
OSPF_MA0 is up, line protocol is up
  Primary Interface GigabitEthernet0/0.13, Area 4
  Interface ID 17
  MTU is 1500 bytes
  Neighbor Count is 1

A traceroute from XR6 should now follow the path over the primary link:

RP/0/0/CPU0:XR6#traceroute 4.4.4.4
Thu Aug 7 21:55:58.302 UTC

Type escape sequence to abort.
Tracing the route to 4.4.4.4

 1  10.0.26.2 0 msec  0 msec  0 msec
 2  10.0.23.3 0 msec  0 msec  0 msec
 3  10.0.13.1 0 msec  0 msec  0 msec
 4  10.0.14.4 0 msec  *  0 msec

OSPF Multi-Area Adjacency is one of those things that can fix some odd corner case topologies. I would not recommend it. The issue is that now R3 has a full area 4 and area 0 database. It’s also messy. Rather redesign your network!

IOS-XR has had this feature since v3.4.1 – A quick config on XR6:

RP/0/0/CPU0:XR6#sh run router ospf
Thu Aug 7 22:05:25.833 UTC
router ospf 1
 area 0
  interface GigabitEthernet0/0/0/0.26
   network point-to-point
  !
 !
 area 10
  multi-area-interface GigabitEthernet0/0/0/0.26
  !
 !
!

To verify on XR you need to look at the last few lines on a show ospf interface:

RP/0/0/CPU0:XR6#show ospf interface gi0/0/0/0.26
Thu Aug 7 22:05:53.841 UTC

GigabitEthernet0/0/0/0.26 is up, line protocol is up
  Internet Address 10.0.26.6/24, Area 0
  Process ID 1, Router ID 10.0.26.6, Network Type POINT_TO_POINT, Cost: 1
  Transmit Delay is 1 sec, State POINT_TO_POINT, MTU 1500, MaxPktSz 1500
  Timer intervals configured, Hello 10, Dead 40, Wait 40, Retransmit 5
    Hello due in 00:00:07:158
  Index 1/1, flood queue length 0
  Next 0(0)/0(0)
  Last flood scan length is 1, maximum is 1
  Last flood scan time is 0 msec, maximum is 0 msec
  LS Ack List: current length 0, high water mark 16
  Neighbor Count is 1, Adjacent neighbor count is 1
    Adjacent with neighbor 2.2.2.2
  Suppress hello for 0 neighbor(s)
  Multi-area interface Count is 1
    Multi-Area interface exist in area 10 Neighbor Count is 1

OSPFv3 Multiarea Adjacency

IOS 15.4(2)T and IOS-XE 3.11S now has support for multi-area adjacency for OSPFv3.

The config is identical to OSPFv2 so I’m not going to go over it here.

That’s a wrap for today.