SPF Delay – CCDE

SPF timers are usually one of those things that engineers don’t bother with. Hello/Dead timers are often adjusted, but not actual SPF timers themselves.

Different vendors, and even different platforms within vendors, can have dramatically different timers. Micro-loops can be even more pronounced when different vendors/platforms are involved.

SPF Timers

In OSPF, SPF is only run when certain conditions are met. One of those conditions is when a router originates a new type-1 LSA. If a router interface goes down, it will originate a new type-1 to let other routers in the area know about it. How soon after the interface goes down does the type-1 get sent? Once another router in the area receives that type-1, does it run SPF straight away? Does it flood the LSA before or after it runs SPF?
Micro-loops form when router’s FIBs do not agree on where the best path is. Two routers will bounce a packet backwards and forwards to each other until those routers agree on the forwarding path and have that path installed in their FIB.

The best way to understand this is to show the loop forming.

Let’s consider the following topology of five routers. The OSPF costs of each link is also displayed:
SPF Timers

Most router interfaces have a cost of 50, while R3 has a second slower link with a cost of 200.

Under normal circumstances, any traffic from R1 to R5 with go through R2-R4.
SPF Timers2

R1#traceroute 10.0.0.5
Type escape sequence to abort.
Tracing the route to 10.0.0.5
VRF info: (vrf in name/id, vrf out name/id)
  1 192.168.12.2 12 msec 32 msec 16 msec
  2 192.168.24.4 44 msec 56 msec 16 msec
  3 192.168.45.5 68 msec 48 msec 48 msec

When the link between R2 and R4 fails, traffic should traverse the R2-R3-R4 links:
SPF Timers3
There are a number of milliseconds where this will not be the case.

In order to show how a micro-loop is formed, I’ll first need to artificially increase my SPF timers. This is because it’s very difficult to show an actual micro-loop simply with traceroute.
On R3 I’ll increase the wait time to run SPF after it receives an LSA to 10 seconds:

R3(config)#router ospf 1
R3(config-router)# timers throttle spf 10000 10000 10000

I’ll now break the link between R2 and R4 and run another traceroute from R1 to R5:

R2(config)#int gi2/0
R2(config-if)#shut
R1#traceroute 10.0.0.5
Type escape sequence to abort.
Tracing the route to 10.0.0.5
VRF info: (vrf in name/id, vrf out name/id)
  1 192.168.12.2 16 msec 16 msec 12 msec
  2  *  *
    192.168.23.3 36 msec
  3 192.168.23.2 40 msec 36 msec 68 msec
  4 192.168.23.3 44 msec 60 msec 60 msec
  5 192.168.23.2 56 msec 64 msec 60 msec
  6 192.168.23.3 100 msec 80 msec 80 msec
  7 192.168.23.2 80 msec 80 msec 84 msec
  8 192.168.23.3 80 msec 104 msec 104 msec
  9 192.168.23.2 100 msec 104 msec 100 msec
 10 192.168.23.3 128 msec 124 msec 124 msec
 11 192.168.23.2 132 msec 116 msec 124 msec
 12 192.168.23.3 152 msec 148 msec 148 msec
 13 192.168.23.2 144 msec 144 msec 148 msec
 14 192.168.23.3 152 msec
    192.168.45.5 112 msec 84 msec

Because R3 is delaying it’s SPF run until 10 seconds after it receives a relevant LSA, it still assumes the best path is through R2. R2 has run it’s SPF and it assumes the best path is through R3. This is the reason the packet bounces between both routers. The packet get to it’s destination only when R3 has run SPF and CEF updated.

Of course in the real world we don’t wait 10 seconds. But what are the actual timers? That depends a lot on which vendor and platform you’re running:
[table]
Vendor,OS,Initial SPF Delay (ms)
Cisco,IOS & IOS-XE,5000
Cisco,IOS-XR,50
Cisco,NX-OS,200
Juniper,Junos,200
[/table]
The above list is of course not exhaustive.

The timers between vendors and platforms can be dramatically different. Even in an environment in when you are not cared about rapid convergence, it’s still important that your IGP routers all agree on their timers. Connecting an ASR1k to an ASR9k with default timers could cause traffic to loop for almost five seconds if left to the defaults. I would suggest you ensure all OSPF routers in an area, or all IS-IS routers in the same level, have identical timers.

Another option is to ensure the initial SPF delay run timer is set high enough so that LSA/LSP reaches all edges of the area/level. That way all router can run SPF at the same time and update their FIBs at the same time. The problem with this approach is that each router receives the LSA at different times. Even if they did receive them at exactly the same time, we are relying on the fact that all routers have 100% identical SPF and FIB-Update run times.

Further Reading

RFC 5715 – A Framework for Loop-Free Convergence
RFC 6976 – Framework for Loop-Free Convergence Using the Ordered Forwarding Information Base (oFIB) Approach

© 2009-2019 Darren O'Connor All Rights Reserved -- Copyright notice by Blog Copyright