Category Archives: Design

SPF Delay – CCDE

SPF timers are usually one of those things that engineers don’t bother with. Hello/Dead timers are often adjusted, but not actual SPF timers themselves.

Different vendors, and even different platforms within vendors, can have dramatically different timers. Micro-loops can be even more pronounced when different vendors/platforms are involved.

SPF Timers

In OSPF, SPF is only run when certain conditions are met. One of those conditions is when a router originates a new type-1 LSA. If a router interface goes down, it will originate a new type-1 to let other routers in the area know about it. How soon after the interface goes down does the type-1 get sent? Once another router in the area receives that type-1, does it run SPF straight away? Does it flood the LSA before or after it runs SPF?
Micro-loops form when router’s FIBs do not agree on where the best path is. Two routers will bounce a packet backwards and forwards to each other until those routers agree on the forwarding path and have that path installed in their FIB.

The best way to understand this is to show the loop forming.

Let’s consider the following topology of five routers. The OSPF costs of each link is also displayed:
SPF Timers SPF Delay   CCDE

Most router interfaces have a cost of 50, while R3 has a second slower link with a cost of 200.

Under normal circumstances, any traffic from R1 to R5 with go through R2-R4.
SPF Timers2 SPF Delay   CCDE

R1#traceroute 10.0.0.5
Type escape sequence to abort.
Tracing the route to 10.0.0.5
VRF info: (vrf in name/id, vrf out name/id)
  1 192.168.12.2 12 msec 32 msec 16 msec
  2 192.168.24.4 44 msec 56 msec 16 msec
  3 192.168.45.5 68 msec 48 msec 48 msec

When the link between R2 and R4 fails, traffic should traverse the R2-R3-R4 links:
SPF Timers3 SPF Delay   CCDE
There are a number of milliseconds where this will not be the case.

In order to show how a micro-loop is formed, I’ll first need to artificially increase my SPF timers. This is because it’s very difficult to show an actual micro-loop simply with traceroute.
On R3 I’ll increase the wait time to run SPF after it receives an LSA to 10 seconds:

R3(config)#router ospf 1
R3(config-router)# timers throttle spf 10000 10000 10000

I’ll now break the link between R2 and R4 and run another traceroute from R1 to R5:

R2(config)#int gi2/0
R2(config-if)#shut
R1#traceroute 10.0.0.5
Type escape sequence to abort.
Tracing the route to 10.0.0.5
VRF info: (vrf in name/id, vrf out name/id)
  1 192.168.12.2 16 msec 16 msec 12 msec
  2  *  *
    192.168.23.3 36 msec
  3 192.168.23.2 40 msec 36 msec 68 msec
  4 192.168.23.3 44 msec 60 msec 60 msec
  5 192.168.23.2 56 msec 64 msec 60 msec
  6 192.168.23.3 100 msec 80 msec 80 msec
  7 192.168.23.2 80 msec 80 msec 84 msec
  8 192.168.23.3 80 msec 104 msec 104 msec
  9 192.168.23.2 100 msec 104 msec 100 msec
 10 192.168.23.3 128 msec 124 msec 124 msec
 11 192.168.23.2 132 msec 116 msec 124 msec
 12 192.168.23.3 152 msec 148 msec 148 msec
 13 192.168.23.2 144 msec 144 msec 148 msec
 14 192.168.23.3 152 msec
    192.168.45.5 112 msec 84 msec

Because R3 is delaying it’s SPF run until 10 seconds after it receives a relevant LSA, it still assumes the best path is through R2. R2 has run it’s SPF and it assumes the best path is through R3. This is the reason the packet bounces between both routers. The packet get to it’s destination only when R3 has run SPF and CEF updated.

Of course in the real world we don’t wait 10 seconds. But what are the actual timers? That depends a lot on which vendor and platform you’re running:

Vendor OS Initial SPF Delay (ms)
Cisco IOS & IOS-XE 5000
Cisco IOS-XR 50
Cisco NX-OS 200
Juniper Junos 200

The above list is of course not exhaustive.

The timers between vendors and platforms can be dramatically different. Even in an environment in when you are not cared about rapid convergence, it’s still important that your IGP routers all agree on their timers. Connecting an ASR1k to an ASR9k with default timers could cause traffic to loop for almost five seconds if left to the defaults. I would suggest you ensure all OSPF routers in an area, or all IS-IS routers in the same level, have identical timers.

Another option is to ensure the initial SPF delay run timer is set high enough so that LSA/LSP reaches all edges of the area/level. That way all router can run SPF at the same time and update their FIBs at the same time. The problem with this approach is that each router receives the LSA at different times. Even if they did receive them at exactly the same time, we are relying on the fact that all routers have 100% identical SPF and FIB-Update run times.

Further Reading

RFC 5715 – A Framework for Loop-Free Convergence
RFC 6976 – Framework for Loop-Free Convergence Using the Ordered Forwarding Information Base (oFIB) Approach

When a vlan is not a vlan

What is a vlan? What is a vlan-id? Are they the same thing?

Generally yes, but in the ISP world a vlan-id can also be a circuit identifier. While your view of a vlan might be a single broadcast domain, you’ll soon see that multiple vlan IDs can share the same single broadcast domain, or the same vlan-id could be in a completely different broadcast domain.

The Problem

I’ve written about this before. Carriers, at least in the UK, are offering more and more aggregated links to Service Providers. Each circuit to customer sites is aggregated over a single high-bandwidth link to your PE router. This cuts down on ports, cables, and man hours to plug them in.

Old way:

carrier old When a vlan is not a vlan

New way:

carrier new When a vlan is not a vlan
How are the p2p circuits aggregated over the core high-bandwidth link? Each p2p link is separated by a vlan tag on the PoP side. So we could say that any packet coming out of the core PE with vlan 2000 goes to site 1, while packets with vlan 3000 go to site 2. What happens if site 1 and site 2 are going to the same customer? What if you are providing a VPLS service to them? It’s essential to note that the vlan tag imposed by the carrier is used simply to determine what packet goes to which circuit. As we control the MPLS core, it’s ultimately up to us to decide which packet belongs in which broadcast domain, and that is regardless of the vlan id used by the carrier.

Relevant Initial Core Config

I’ll use the following topology:
vlans core When a vlan is not a vlan

R1, R2, and R3 are the core of the network. R1 is a Brocade Netiron running MPLS. R2 is a Cisco me3600x running MPLS. R2 is an me3600x running bridge-groups with no MPLS.

CE1, CE2, and CE3 are all customer routers.

R1 – Brocade XMR

interface ethernet 2/4
 port-name TO-R2
 enable
 route-only
 ip ospf area 0
 ip ospf network point-to-point
 ip address 10.10.10.10/24
!
router mpls
 policy
  traffic-eng ospf area 0

  mpls-interface e2/4

 lsp R1-R2
  to 192.168.224.4
  adaptive
  enable

R2 – Cisco me3600x running MPLS

mpls traffic-eng tunnels
!
router ospf 1
 mpls traffic-eng router-id Loopback0
 mpls traffic-eng area 0
!
interface GigabitEthernet0/1
 description TO-R1
 no switchport
 ip address 10.10.10.11 255.255.255.0
 ip ospf network point-to-point
 ip ospf 1 area 0
 mpls traffic-eng tunnels
!
interface Tunnel0
 ip unnumbered Loopback0
 tunnel mode mpls traffic-eng
 tunnel destination 192.168.224.61
 tunnel mpls traffic-eng autoroute announce
 tunnel mpls traffic-eng path-option 5 dynamic
 tunnel mpls traffic-eng record-route

There is no IP and MPLS configuration on R3 as it’s not running MPLS. I’ll show how the bridge-group is configured when I get to that part.

CPE Config

I’ll be using vlan 3000 to get to CE1, vlan 2000 to get to CE2, and double-tag vlan 3500,2500 to get to CE3. Each CE has their WAN interface in the same subnet as each other running OSPF. I’ll also enable OSPF on their loopbacks and WAN links.

CE1

This is a Juniper EX3200:

root@CE1> show configuration interfaces ge-0/0/0
vlan-tagging;
unit 3000 {
    vlan-id 3000;
    family inet {
        address 1.1.1.1/24;
    }
}

root@CE1> show configuration interfaces lo0.0
family inet {
    address 10.10.10.10/32;
}

root@CE1> show configuration protocols ospf
area 0.0.0.0 {
    interface ge-0/0/0.3000;
    interface lo0.0;
}

CE2

This is a Cisco 3750G:

interface Loopback0
 ip address 20.20.20.20 255.255.255.255
 ip ospf 1 area 0
!
interface Vlan2000
 ip address 1.1.1.2 255.255.255.0
 ip ospf 1 area 0
!
interface GigabitEthernet1/0/1
 switchport trunk encapsulation dot1q
 switchport trunk allowed vlan 2000
 switchport mode trunk

CE3

This is a Cisco 1841:

interface Loopback0
 ip address 30.30.30.30 255.255.255.255
 ip ospf 1 area 0
!
interface FastEthernet0/0.32
 encapsulation dot1Q 3500 second-dot1q 2500
 ip address 1.1.1.3 255.255.255.0
 ip ospf 1 area 0

VPLS Config

As you can see, each CPE will be using a different vlan tag. One site is even sending a double-tagged frame. They all need to be in the same broadcast domain. No problem as we are simply going to use the vlan tag to determine the service.

R2

Gi0/2 will create a LDP-signalled VPLS VC to R1 (aka manual set up). Interface gi0/2 vlan 2000 will be part of VPLS id 501:

ethernet evc TEST-EVC
 uni count 20
!
l2vpn vfi context TEST-VPLS
 vpn id 501
 member 192.168.224.61 encapsulation mpls
!
interface GigabitEthernet0/2
 switchport trunk allowed vlan none
 switchport mode trunk
 mtu 9800
 service instance 1 ethernet TEST-EVC
  encapsulation dot1q 2000
  rewrite ingress tag pop 1 symmetric
  bridge-domain 501
 !
interface Vlan501
 no ip address
 member vfi TEST-VPLS

What’s important to note here is that the me3600x still uses bridge-groups for VPLS, but it’s not exactly the same as just using bridge-groups by itself. You’ll see this soon enough when we configure R3.

R1

R1 will create a VPLS to R2. Vlan 3000 on interface 2/5 will be part of the same VPLS:

router mpls
 vpls TEST-VPLS 501
  vpls-peer 192.168.224.4
  vpls-mtu 1500
  vlan 3000
   tagged ethe 2/5

At this point R1 and R2 have the VPLS set up between them. Each CE is using different vlans on their WAN, but they are in fact on the same broadcast domain:

CE2#sh ip ospf neighbor

Neighbor ID     Pri   State           Dead Time   Address         Interface
1.1.1.1         128   FULL/DR         00:00:39    1.1.1.1         Vlan2000

CE2#ping 1.1.1.1

Type escape sequence to abort.
Sending 5, 100-byte ICMP Echos to 1.1.1.1, timeout is 2 seconds:
!!!!!
Success rate is 100 percent (5/5), round-trip min/avg/max = 1/5/17 ms

CE2#ping 10.10.10.10 so lo0

Type escape sequence to abort.
Sending 5, 100-byte ICMP Echos to 10.10.10.10, timeout is 2 seconds:
Packet sent with a source address of 20.20.20.20
!!!!!
Success rate is 100 percent (5/5), round-trip min/avg/max = 1/4/9 ms

The vlan-id used on the CPE, was merely used to push the frame into the correct VPLS. The VPLS itself is the broadcast domain, the vlan tag is irrelevant as its stripped on inbound into the PE router. You CAN however, ensure that the PE router does NOT strip the vlan tag. This has interesting use cases when you purposely want to separate on vlan id with in the VPLS. I wrote more on this over here so please give it a read. Both the Brocade and Cisco default to VC mode 5 when setting up a VPLS.

Bridge Group Config

I’m going to set up R3 so that it only uses bridge-groups. No routing or MPLS involved. Bridge-Groups work very similar to VPLS, though it’s on a single box. Traffic can be pushed from a bridge-group into a VPLS if needed. The bridge-group determines the broadcast domain. I can have multiple different vlans in the same bridge group.

For R3, gi0/2 is the interface pointing towards the core, while gi0/1 is pointing towards the customer. I’ll use different vlan ids on each, but they will be in the same bridge-group:

ethernet evc TEST
!
vlan 501
 name TEST-CE
!
interface GigabitEthernet0/1
 switchport trunk allowed vlan none
 switchport mode trunk
 service instance 1 ethernet TEST
  encapsulation dot1q 501
  rewrite ingress tag pop 1 symmetric
  bridge-domain 501
 !
interface GigabitEthernet0/2
 switchport trunk allowed vlan none
 switchport mode trunk
 service instance 1 ethernet TEST
  encapsulation dot1q 3500 second-dot1q 2500
  rewrite ingress tag pop 2 symmetric
  bridge-domain 501

I’m not going into detail, but I will cover the basics. When gi0/2 receives a double-tagged frame that matches 3500,2500 inbound, the me3600x will pop both tags off and the resulting frame will be part of bridge-group 501. Symmetric means that when a frame leaves gi0/2, it will re-add vlans 3500,2500 on top of the frame. As gi0/1 is also in bridge-group 501, the customer frame will be forwarded out that port, and it will have a single vlan tag of 501 popped on top.

At this point gi0/1 is connected to R1 eth2/3. For this customer I would be expecting a single tag of 501 coming inbound, and so I’ll place that vlan id into the VPLS from above:

 vpls TEST-VPLS 501
  vlan 501
   tagged ethe 2/3

Now all three CE routers should be fully adjacent:

CE3#sh ip ospf neighbor

Neighbor ID     Pri   State           Dead Time   Address         Interface
1.1.1.1         128   FULL/DR         00:00:35    1.1.1.1         FastEthernet0/0.32
1.1.1.2           1   FULL/DROTHER    00:00:37    1.1.1.2         FastEthernet0/0.32

CE3#ping 10.10.10.10 so lo0

Type escape sequence to abort.
Sending 5, 100-byte ICMP Echos to 10.10.10.10, timeout is 2 seconds:
Packet sent with a source address of 30.30.30.30
!!!!!
Success rate is 100 percent (5/5), round-trip min/avg/max = 1/2/4 ms
CE3#ping 20.20.20.20 so lo0

Type escape sequence to abort.
Sending 5, 100-byte ICMP Echos to 20.20.20.20, timeout is 2 seconds:
Packet sent with a source address of 30.30.30.30
!!!!!
Success rate is 100 percent (5/5), round-trip min/avg/max = 1/4/12 ms

Conclusions:

vlan tags have multiple uses. In most networks it informs the switches which vlan, and therefore broadcast domain, a frame is part of. They can also be circuit identifiers showing which VPLS/Circuit the frame belongs to. They can also be both at the same time, depending on the VPLS VC type you’re using.

For the above network it’s extremely simplified. Care must be taken when forwarding certain layer2 control frames. Most are sent untagged out tagged interfaces. Cisco’s RSTP+ and STP tag each vlan BPDU with a the same vlan-id. If you’re using vlan 2000 on one side and vlan 3000 on the other, and the BPDU gets through, one side will shut down their WAN link due to receiving a BPDU with a vlan tag that doesn’t match the BPDU data inside the frame.

Remote Triggered Black Hole Filtering and Flowspec

RTBH is a mature technology widely used to lower the effects of a DDOS attack against a customer of yours. While it works well, it’s a bit of a sledgehammer. Flowspec is a new technology that gives you a lot more control over what is blocked and as such it’s a lot more powerful.

I’ll be using the following diagram for this post:
RTBH Flowspec Remote Triggered Black Hole Filtering and Flowspec
P1 and P2 are edge routers peering with transit peers. R3 is a route-reflector which is peered to both P1 and P2. C1 is a customer attached to P3 originating their own address space (172.16.0.0/16)

RTBH

RTBH works on the concept of black-holing traffic towards an IP host/subnet. It does this by advertising a statically injected static route which has been pre-defined to have a next-hop to null0/discard.

As an example, let’s assume a host with the address 172.16.200.10 is under attack. R3, the RR is the route-injector, but it can be any of the internal iBGP routers. There is quite a bit of upfront config with RTBH, but most of this config only needs to be done once.

On all BGP routers in the core you need a route that will be discarded:

darreno@P1> show configuration routing-options
static {
    route 192.0.2.1/32 discard;
}

On all routers I want routes learned with a certain community to have their next-hop pointing to the discard route:

darreno@P1> show configuration policy-options
policy-statement BLACK-HOLE-FILTER {
    term 1 {
        from community BLACK_HOLE;
        then {
            next-hop 192.0.2.1;
        }
    }
}
community BLACK_HOLE members 65401:666;

I’m going to apply this an an inbound filter on my iBGP sessions:

darreno@P1> show configuration protocols bgp group ISP1
import BLACK-HOLE-FILTER;

Basically we are saying that any routes learned via BGP with the above community, set your next-hop to discard. On the route injector we set up an export policy matching static routes with a tag of 666. Any route matching will have the black hole community added. As this will be a specific route we need to ensure it doesn’t leave the confines of our AS and so we also tag no-export:

darreno@P3_RR> show configuration policy-options
policy-statement RTBH {
    term BLACK-HOLE {
        from {
            protocol static;
            tag 666;
        }
        then {
            local-preference 5000;
            community add no-export;
            community add BLACK_HOLE;
            next-hop 192.0.2.1;
            accept;
        }
    }
}
community BLACK_HOLE members 65401:666;
community no-export members no-export;

The above policy is then applied outbound on the iBGP session on the route-injector:

darreno@P3_RR> show configuration protocols bgp group ISP1
local-address 192.168.0.3;
export RTBH;

RTBH testing and verification

From a router out on the internet I can currently ping the affected host:

darreno@INTERNET> ping 172.16.200.10 interface lo0.0 rapid
PING 172.16.200.10 (172.16.200.10): 56 data bytes
!!!!!
--- 172.16.200.10 ping statistics ---
5 packets transmitted, 5 packets received, 0% packet loss
round-trip min/avg/max/stddev = 7.836/10.747/14.091/2.430 ms

I’ll now implement a black hole static on the route-injector:

set routing-options static route 172.16.200.10/32 next-hop 192.0.2.1 resolve tag 666

[edit]
darreno@P3_RR# commit and-quit
commit complete
Exiting configuration mode

If we ping from the internet again:

darreno@INTERNET> ping 172.16.200.10 interface lo0.0 rapid
PING 172.16.200.10 (172.16.200.10): 56 data bytes
.....
--- 172.16.200.10 ping statistics ---
5 packets transmitted, 0 packets received, 100% packet loss

All packets lost. We can ensure only this /32 is affected by pinging another host in the subnet:

darreno@INTERNET> ping 172.16.200.50 interface lo0.0 rapid
PING 172.16.200.50 (172.16.200.50): 56 data bytes
!!!!!
--- 172.16.200.50 ping statistics ---
5 packets transmitted, 5 packets received, 0% packet loss
round-trip min/avg/max/stddev = 5.582/7.116/10.018/1.658 ms

Looking at the edge routers we see the learned /32, and the next-hop of discard:

darreno@P1> show route 172.16.200.10 extensive

inet.0: 17 destinations, 17 routes (17 active, 0 holddown, 0 hidden)
172.16.200.10/32 (1 entry, 1 announced)
TSI:
KRT in-kernel 172.16.200.10/32 -> {indirect(262143)}
        *BGP    Preference: 170/-5001
                Next hop type: Indirect
                Address: 0x97106d0
                Next-hop reference count: 3
                Source: 192.168.0.3
                Next hop type: Discard
                Protocol next hop: 192.0.2.1
                Indirect next hop: 94781d0 262143
                State: 
                Local AS: 65401 Peer AS: 65401
                Age: 2:15       Metric2: 0
                Task: BGP_65401.192.168.0.3+64669
                Announcement bits (2): 0-KRT 4-Resolve tree 1
                AS path: I
                Communities: 65401:666 no-export
                Accepted
                Localpref: 5000
                Router ID: 192.168.0.3
                Indirect next hops: 1
                        Protocol next hop: 192.0.2.1 Metric: 0
                        Indirect next hop: 94781d0 262143
                        Indirect path forwarding next hops: 0
                                Next hop type: Discard
                        192.0.2.1/32 Originating RIB: inet.0
                          Metric: 0                       Node path count: 1
                          Forwarding nexthops: 0
                                Next hop type: Discard

The /32 route has been learned through BGP from the route-injector. The correct communities are set. The next-hop goes to a route that is discard, and hence any packets going to this host are now discarded.

Adding and removing hosts are are simple as adding or removing routes on the route-injector.

The above works extremely well, but until the attack is finished and routes removed, that IP address is unroutable over the internet. Any traffic at all going towards it will be black-holed.

Flowspec

There is a more subtle way of doing the above. RFC5575 is the definition of a new filtering mechanism called flowspec. Oddly half the RFC authors are Cisco employess, yet as of today I can only find support for flowspec on Junos.

Essentially flowspec allows routers to advertise firewall filters to your edge BGP devices directly through BGP. Because this is a filter, it allows you to use all the actions of a regular firewall filter. Do you want to police DNS traffic only in a DNS amplification attack? Simple. Flowspec gives you the flexibility to do so.

The first part of enabling flowspec is to configure BGP to carry the NLRI. This will be done on all your internal routers:

darreno@P1> configure
Entering configuration mode

[edit]
darreno@P1# set protocols bgp group ISP1 family inet flow

[edit]
darreno@P1# commit and-quit
commit complete
Exiting configuration mode

Now let’s suppose 172.16.200.10 is under some kind of ICMP attack. I want to block all ICMP traffic to this host from the edge routers, but still allow other traffic through to the host:

root@R3_RR> show configuration routing-options flow
route BLOCK-ICMP-172.16.200.10 {
    match {
        destination 172.16.200.10/32;
        protocol icmp;
    }
    then discard;
}
term-order standard;

This router will now advertise this filter to all other iBGP peers.

Flowspec testing and verification

We can test this from the internet by trying to ping to this address, and then trying to FTP. Ping should fail, while FTP should be let through:

root@INTERNET> ping 172.16.200.10 source 192.168.50.1 rapid
PING 172.16.200.10 (172.16.200.10): 56 data bytes
.....
--- 172.16.200.10 ping statistics ---
5 packets transmitted, 0 packets received, 100% packet loss
root@INTERNET> ftp 172.16.200.10 source 192.168.50.1
Connected to 172.16.200.10.
220 C1 FTP server (Version 6.00LS) ready.
Name (172.16.200.10:root): darreno
331 Password required for darreno.
Password:
230 User darreno logged in.
Remote system type is UNIX.
Using binary mode to transfer files.
ftp>

This works exactly as expected.

You can verify the flow NLRI coming in and applied as a filter on the edge routers:

root@P2> show route table inetflow.0

inetflow.0: 1 destinations, 1 routes (1 active, 0 holddown, 0 hidden)
+ = Active Route, - = Last Active, * = Both

172.16.200.10,*,proto=1/term:1
                   *[BGP/170] 00:12:57, localpref 100, from 192.168.0.3
                      AS path: I, validation-state: unverified
                      Fictitious

root@P2> show firewall

Filter: __default_bpdu_filter__

Filter: __flowspec_default_inet__
Counters:
Name                                                Bytes              Packets
172.16.200.10,*,proto=1                              2352                   28

172.16.200.10,* – meaning destination address 172.16.200.10/32 with any source – proto=1 is ICMP

Conclusion

  • Flowspec gives you a lot more options when it comes to filtering out DDOS attacks. Instead of isolating an IP you are able to filter specific traffic only. These firewall filters are then advertised via BGP to all your iBGP speakers.
  • As this is a firewall filter, you don’t have to specify a discard action. You can just as easily set a policing action.
  • Currently junos supports flowspec on both the inet and family-inet-vpn familes. So no v6 support yet
  • Most other vendors still don’t have working implementations

Are you sure it’s the shortest path? – OSPF Multi Area issues

So does OSPF always use the shortest path in in order to ensure that packets always get from A to B with the lowest end to end cost? Not always. In fact when you have more than a single area it’s very easy to NOT go the shortest path at all. You could even turn your ‘non-transit’ 10Mb links into transit links.

Let’s take the following network as an example:
MAOSPF 1 Are you sure its the shortest path?   OSPF Multi Area issues

R3 represents our core. R1 and R2 are both aggregation boxes where all our customers connect to. These boxes are connected into the core with their Gig links. R4 is our first customer. Mr customer wants a primary Gig link with a 100Mb backup link. We have decided to put each customer into their own OSPF area. We will also be changing the auto-cost reference bandwidth to 100Gb to ensure our core sees the difference between 100Mb and Gig links:

R3
interface Loopback0
 ip address 3.3.3.3 255.255.255.255
 ip ospf 1 area 0

interface GigabitEthernet1/0
 ip address 10.0.13.3 255.255.255.0
 ip ospf 1 area 0
!
interface GigabitEthernet2/0
 ip address 10.0.23.3 255.255.255.0
 ip ospf 1 area 0
!         
router ospf 1
 router-id 3.3.3.3
 auto-cost reference-bandwidth 100000
R1
interface GigabitEthernet2/0
 ip address 10.0.13.1 255.255.255.0
 ip ospf 1 area 4
!
interface GigabitEthernet1/0
 ip address 10.0.14.1 255.255.255.0
 ip ospf 1 area 0
!         
router ospf 1
 router-id 1.1.1.1
 auto-cost reference-bandwidth 100000
R2
interface GigabitEthernet2/0
 ip address 10.0.23.2 255.255.255.0
 ip ospf 1 area 0
!
interface FastEthernet1/0
 ip address 10.0.24.2 255.255.255.0
 ip ospf 1 area 4
!
router ospf 1
 router-id 2.2.2.2
 auto-cost reference-bandwidth 100000
R4
interface Loopback0
 ip address 4.4.4.4 255.255.255.255
 ip ospf 1 area 4
!         
interface GigabitEthernet1/0
 ip address 10.0.14.4 255.255.255.0
 ip ospf 1 area 4
!
interface FastEthernet2/0
 ip address 10.0.24.4 255.255.255.0
 ip ospf 1 area 4
!
router ospf 1
 router-id 4.4.4.4
 auto-cost reference-bandwidth 100000

Our core should now see that the best way to get to R4′s loopback is to go through R1:

3#sh ip route 4.4.4.4
Routing entry for 4.4.4.4/32
  Known via "ospf 1", distance 110, metric 201, type inter area
  Last update from 10.0.13.1 on GigabitEthernet1/0, 00:09:44 ago
  Routing Descriptor Blocks:
  * 10.0.13.1, from 1.1.1.1, 00:09:44 ago, via GigabitEthernet1/0
      Route metric is 201, traffic share count is 1
R3#traceroute 4.4.4.4
Type escape sequence to abort.
Tracing the route to 4.4.4.4
VRF info: (vrf in name/id, vrf out name/id)
  1 10.0.13.1 8 msec 20 msec 16 msec
  2 10.0.14.4 16 msec *  20 msec

Similarly R4 should see that the best way to get to R3 is back through R1:

R4#sh ip route 3.3.3.3
Routing entry for 3.3.3.3/32
  Known via "ospf 1", distance 110, metric 201, type inter area
  Last update from 10.0.14.1 on GigabitEthernet1/0, 00:03:47 ago
  Routing Descriptor Blocks:
  * 10.0.14.1, from 1.1.1.1, 00:03:47 ago, via GigabitEthernet1/0
      Route metric is 201, traffic share count is 1
R4#traceroute 3.3.3.3
Type escape sequence to abort.
Tracing the route to 3.3.3.3
VRF info: (vrf in name/id, vrf out name/id)
  1 10.0.14.1 16 msec 16 msec 20 msec
  2 10.0.13.3 20 msec *  20 msec

So everything is fine. Or so we think. There is already a problem here, but it won’t cause a problem until we bring in another customer. Let’s add 2 customers. The first is connected to R1 and the second is connected to R2. Both of these customers have purchased 100Mb single links.

MAOSPF 2 Are you sure its the shortest path?   OSPF Multi Area issues

So, traffic sent from R4′s loopback to either of the 2 new customers loopbacks should get into the core via R4′s 1Gb primary link. Is that what we see?

R4
R4#traceroute 5.5.5.5 source 4.4.4.4
Type escape sequence to abort.
Tracing the route to 5.5.5.5
VRF info: (vrf in name/id, vrf out name/id)
  1 10.0.14.1 16 msec 20 msec 20 msec
  2 10.0.15.5 20 msec *  24 msec
R4#
R4#traceroute 6.6.6.6 source 4.4.4.4
Type escape sequence to abort.
Tracing the route to 6.6.6.6
VRF info: (vrf in name/id, vrf out name/id)
  1 10.0.14.1 12 msec 20 msec 16 msec
  2 10.0.13.3 20 msec 60 msec 20 msec
  3 10.0.23.2 40 msec 44 msec 40 msec
  4 10.0.26.6 72 msec *  44 msec

That’s exactly what we see, but do we have the full picture here? Let’s trace from these new customers to R4′s loopback. Again both should go over R4′s 1Gb primary link:

R5
R5#traceroute 4.4.4.4 source 5.5.5.5
Type escape sequence to abort.
Tracing the route to 4.4.4.4
VRF info: (vrf in name/id, vrf out name/id)
  1 10.0.15.1 8 msec 20 msec 16 msec
  2 10.0.14.4 20 msec *  24 msec

R5 is correct. What about R6?

R6
R6#traceroute 4.4.4.4 source 6.6.6.6
Type escape sequence to abort.
Tracing the route to 4.4.4.4
VRF info: (vrf in name/id, vrf out name/id)
  1 10.0.26.2 20 msec 16 msec 20 msec
  2 10.0.24.4 64 msec *  68 msec

Well this is most certainly NOT correct. Why is this traceroute going through R4′s 100Mb backup link? Let’s go back to the beginning and see what we missed. Let’s have a look at the 3 core routers to see how they all want to get to 4.4.4.4:

R3
R3#sh ip route 4.4.4.4
Routing entry for 4.4.4.4/32
  Known via "ospf 1", distance 110, metric 201, type inter area
  Last update from 10.0.13.1 on GigabitEthernet1/0, 00:29:17 ago
  Routing Descriptor Blocks:
  * 10.0.13.1, from 1.1.1.1, 00:29:17 ago, via GigabitEthernet1/0
      Route metric is 201, traffic share count is 1
R1
R1#sh ip route 4.4.4.4
Routing entry for 4.4.4.4/32
  Known via "ospf 1", distance 110, metric 101, type intra area
  Last update from 10.0.14.4 on GigabitEthernet1/0, 00:37:44 ago
  Routing Descriptor Blocks:
  * 10.0.14.4, from 4.4.4.4, 00:37:44 ago, via GigabitEthernet1/0
      Route metric is 101, traffic share count is 1
R2
R2#sh ip route 4.4.4.4
Routing entry for 4.4.4.4/32
  Known via "ospf 1", distance 110, metric 1001, type intra area
  Last update from 10.0.24.4 on FastEthernet1/0, 00:38:47 ago
  Routing Descriptor Blocks:
  * 10.0.24.4, from 4.4.4.4, 00:38:47 ago, via FastEthernet1/0
      Route metric is 1001, traffic share count is 1

Here is the problem. R2 prefers to get to 4.4.4.4 over it’s directly connected link, even though the metric through R3 would be 401, a whole lot less than 1001.

The issue is that OSPF has it’s own selection process. Regardless of metric, OSPF will ALWAYS prefer intra area routes over inter area routes over external routes. R2 has an interface in Area 4, the same area in which it’s learning about R4′s loopback. Hence when traffic addressed to 4.4.4.4 passes through it, it will always send it off over it’s area 4 interface, no matter how slow it is. It doesn’t make any difference if the second customer is in area 0 or their own area.

In fact, if you dive a bit deeper, you can see that as far as R6 is concerned, the traffic will be going over R4′s primary link. If you see the interface cost of R6′s link as well as the cost end to end this is what you get:

R6
R6#sh ip os int brief | include Fa1/0
Fa1/0        1     0               10.0.26.6/24       1000  BDR   1/1
R6#                                  
R6#
R6#sh ip route 4.4.4.4               
Routing entry for 4.4.4.4/32
  Known via "ospf 1", distance 110, metric 1301, type inter area
  Last update from 10.0.26.2 on FastEthernet1/0, 00:19:15 ago
  Routing Descriptor Blocks:
  * 10.0.26.2, from 1.1.1.1, 00:19:15 ago, via FastEthernet1/0
      Route metric is 1301, traffic share count is 1

What about R2′s active route cost?

R2
R2#sh ip route 4.4.4.4
Routing entry for 4.4.4.4/32
  Known via "ospf 1", distance 110, metric 1001, type intra area
  Last update from 10.0.24.4 on FastEthernet2/0, 00:43:29 ago
  Routing Descriptor Blocks:
  * 10.0.24.4, from 4.4.4.4, 00:43:29 ago, via FastEthernet2/0
      Route metric is 1001, traffic share count is 1

So R6 thinks that traffic will actually go over it’s 1000 cost link, then over the 3 X 100 cost Gig links. But R2 effectively ‘highjacks’ this traffic to send it over it’s direct area 4 link.

So, how can this be fixed?

The first way is to just put everything in area 0. This way all addresses will be reachable via inter area links in area 0. Even if you injected all prefixes in via redistribution or route-policy they’ll all be external, but still reachable through area 0 links.

The second way is to create some sort of tunnel between R1 and R2 and put that tunnel interface into area 4. This way R2 would learn about R4′s loopback over 2 area 4 interfaces. You would need to ensure this tunnel interface has a lower cost than the 100Mb direct connection to R4 in order for traffic to actually be preferred. But who really wants to be creating tunnels over the core of their network? Virtual-links can only be used to connect to area 0, not area 4. Sham links? Can only be used with MPLS.

The third way is thinking outside the box a little. You could use PPPoE over the secondary link and not use OSPF on the link. On R4 you would have a floating static route pointing towards the dialer interface. The actual radius account you use would create a static route to R4′s loopback with a next-hop of the p2p PPPoE link. Ensure the static route is created with a AD higher than OSPF to ensure it’ll use the OSPF link if available.

The fourth way is to just use another protocol connecting the core to the CPE device. BGP perhaps?

The fifth, final, and ties with option 1 for simplicity’s sake is using RFC 5185 – OSPF Multi-Area Adjacency. What this RFC states is the ability to put a routers interface into more than a single OSPF area. This means that I could keep R1 and R2′s links in area 0, but put those same links into area 4. The same would be done for R3. This means that R2 would learn the best from from R1 as an intra area route, without the need for dodgy tunnels. The main problem is that most vendors simply don’t have support for it. Cisco only has it in IOS XE. JUNOS had it since JUNOS 9.4 though. Brocade? No mention of it anywhere yet.

Considering I have some post 9.4 JunOS boxes here, let’s test this out:

R2
> show configuration protocols ospf
reference-bandwidth 10g;
area 0.0.0.0 {
    interface fe-0/0/1.66;
    interface fe-0/0/0.51;
}
area 0.0.0.4 {
    interface fe-1/3/0.16 {
        metric 1000;
    }
    interface fe-0/0/1.66 {
        secondary;
    }
}
R3
> show configuration protocols ospf
reference-bandwidth 10g;
area 0.0.0.0 {
    interface fe-0/0/0.66;
    interface fe-0/0/0.63;
    interface lo0.9;
}
area 0.0.0.4 {
    interface fe-0/0/0.66 {
        secondary;
    }
    interface fe-0/0/0.63 {
        secondary;
    }
}
R1
> show configuration protocols ospf
reference-bandwidth 10g;
area 0.0.0.0 {
    interface fe-0/0/1.63;
    interface fe-1/3/3.79;
}
area 0.0.0.4 {
    interface fe-1/3/0.14;
    interface fe-0/0/1.63 {
        secondary;
    }
}

As you can see, the configuration is pretty simple. You simple add an interface to another area and set it as secondary. Let’s have a look at R2′s neighbours:

> show ospf neighbor
Address          Interface              State     ID               Pri  Dead
10.0.26.6        fe-0/0/0.51            Full      6.6.6.6          128    38
  Area 0.0.0.0
10.0.23.3        fe-0/0/1.66            Full      3.3.3.3          128    38
  Area 0.0.0.0
10.0.23.3        fe-0/0/1.66            Full      3.3.3.3          128    38
  Area 0.0.0.4
10.0.24.4        fe-1/3/0.16            Full      4.4.4.4          128    33
  Area 0.0.0.4

R2 has an adjacency over fe-0/0/1.66 twice. One in Area 0 and one in Area 4. This means it should be learning R4′s loopback as 2 intra-area and 1 inter-area route. It should then choose the path through R3 as it has the better metric:

> show route 4.4.4.4

inet.0: 17 destinations, 17 routes (17 active, 0 holddown, 0 hidden)
+ = Active Route, - = Last Active, * = Both

4.4.4.4/32         *[OSPF/10] 00:18:07, metric 300
                    > to 10.0.23.3 via fe-0/0/1.66

Which is exactly what we see.

Let’s do another traceroute from R6 to confirm:

> traceroute 4.4.4.4
traceroute to 4.4.4.4 (4.4.4.4), 30 hops max, 40 byte packets
 1  10.0.26.2 (10.0.26.2)  1.098 ms  0.965 ms  0.800 ms
 2  10.0.23.3 (10.0.23.3)  0.846 ms  0.943 ms  0.836 ms
 3  10.0.13.1 (10.0.13.1)  0.884 ms  1.036 ms  0.882 ms
 4  4.4.4.4 (4.4.4.4)  1.166 ms  1.328 ms  1.155 ms

Fun with QinQ

Not all service providers have blanket coverage of an entire country. When a service provider gives service to a customer, more than likely the customer will be using a line from an external carrier of the ISP’s choosing.

Metro Ethernet is becoming more and more common these days and it does give us as designers a lot of flexibility. However each time a customer purchases a leased line, that requires another port in your core. That’s fine if the circuit is a nice gig, but quite often a lot of office will have anything from 2Mb to 1Gb. Do you really want to waste your core gig ports for a 2Mb circuit? Not really.

A lot of carriers are now offering aggregated ethernet links. Essentially this means that each customer site has a separate port (of course) but these are all aggregated when the carrier hands off to us. We get a single link carrying a bunch of customer circuits. Now you can sell hundreds of 2Mb circuits and only use a single port.
QinQ VPLS 1 Fun with QinQ

But how do we separate traffic then? Well you’ll come to an agreement with the carrier and each circuit ordered will have a vlan tag on the core side. This means customer 1 site 1′s traffic will arrive on vlan 1000. customer2 site 1′s traffic will arrive on vlan 1001. Now you just need to stick each tag into an MPLS solution and all is good.
QinQ VPLS 2 Fun with QinQ

But now what happens when you need to run multiple virtual circuits to a single customer site over their leased line? The vlan tag is already used by the carrier. What if I need to run 2 vrf’s for the customer and I need a WAN interface in each vrf?

We could do QinQ, but let’s think about this. In order to do QinQ I need another device at the customer site to pop another vlan tag on. I then need to get that tag into the core, pop off the tag and stick it back into the core. This could get messy.
Another problem with regular Cisco switches is that I can only do dot1q encapsulation on a port based. This means I need to use a port for every customer again, negating the advantage of the aggregated port to begin with. Or I can send traffic to another switch on a per-port basis, then get the second switch to aggregate the vlans back into the core. This will work, but what a nightmare to support.
Option1:
QinQ VPLS 3 Fun with QinQ

Option2:
QinQ VPLS 4 Fun with QinQ

Instead of a regular switch I could use a ME3400G and do selective QinQ. This allows me to specify that vlan 10 and 20 gets vlan 1000 popped onto it, and vlan 15 and 25, on the same port, gets vlan 1001 popped onto it.

It’s still another device that we could do without.

Let’s tackle the problem at the customer site first. Ideally I would like to do away with a separate device that does my QinQ. I can’t send double-tagged frames directly out my Cisco. Are we sure about this?

interface GigabitEthernet0/1.500
 encapsulation dot1Q 1000 second-dot1q 500
 ip address 172.16.255.1 255.255.255.0
end

The second-dot1q command allows you to both send and receive QinQ traffic directly from a router interface. The first tag is the metro tag, while the second tag is the inner tag. Let’s create another interface in another vrf.

interface GigabitEthernet0/1.600
 encapsulation dot1Q 1000 second-dot1q 600
 ip vrf forwarding 1
 ip address 192.168.1.2 255.255.255.0

Perfect. I can create as many sub-interfaces as I need using a single metro tag.

What about my core? In my core I’m using Brocade Netirons. Let’s see if I can terminate a double-tagged frame:

vpls Test_Customer 500
  vpls-peer 192.168.1.1
   vlan 1000 inner-vlan 500
   tagged ethe 15/1
!
 vpls Test_Customer 600
  vpls-peer 192.168.1.50
   vlan 1000 inner-vlan 600
   tagged ethe 15/1

No problems there. I can create 2 completely separate VPLS instances using the same single metro tag.

So far so good…

Let’s say for whatever reason I now need to run a point-to-point link directly from my core to the CPE at the customer site. I don’t want to terminate this point-to-point link into a VLL/VPLS. This could be for management, to provide voice or internet access, whatever. Unfortunately my Brocade device cannot terminate a double-tagged frame both in the local table as well as in a VPLS. Now we’re close to going back to our earlier switch examples to pop off that pesky outer tag.

But never fear, there is always a way. I did note above that we can’t terminate a double-tagged frame on both. How about sending some traffic as double-tagged and certain traffic as single? Can we do this?

Let’s start again with the CPE and show the full relevant config:

interface GigabitEthernet0/1.1000
 encapsulation dot1Q 1000
 ip address 10.0.0.1 255.255.255.0
!
interface GigabitEthernet0/1.600
 encapsulation dot1Q 1000 second-dot1q 600
 ip vrf forwarding 1
 ip address 192.168.1.2 255.255.255.0
!
interface GigabitEthernet0/1.500
 encapsulation dot1Q 1000 second-dot1q 500
 ip address 172.16.255.1 255.255.255.0

Works just fine on the Cisco. What about my Brocade?

vlan 1000 name Test
 tagged ethe 15/1
 router-interface ve 1000
!
interface ve 1000
 ip address 10.0.0.2/24
!
router mpls
!
vpls Test_Customer 500
  vpls-peer 192.168.1.1
   vlan 1000 inner-vlan 500
   tagged ethe 15/1
!
 vpls Test_Customer 600
  vpls-peer 192.168.1.50
   vlan 1000 inner-vlan 600
   tagged ethe 15/1

Everything tested and everything works perfectly. We’ve managed to remove all kinds of kit and also managed to simply the solution.