The dangers of ignoring OSPF MTU

On February 6, 2013, in CCIE, Fundamentals, JNCIE, by Darren

Quite often I see ip ospf mtu-ignore configured when two router’s MTU have a mismatch. This is bad. To demonstrate why I’ll use the following simple topology:
ospf MTU The dangers of ignoring OSPF MTU

Let’s create a simple area 0 point-to-point adjacency between the two routers and make R1′s MTU slightly larger. Then ignore OSPF MTU otherwise the adjacency will not come up:

R1
interface GigabitEthernet1/0
 mtu 2000
 ip address 10.0.0.1 255.255.255.0
 ip ospf network point-to-point
 ip ospf mtu-ignore
 ip ospf 1 area 0

The adjacency is fine as far as we can see:

R1#sh ip ospf neighbor | beg Nei
Neighbor ID     Pri   State           Dead Time   Address         Interface
1.2.3.255         0   FULL/  -        00:00:30    10.0.0.1        GigabitEthernet1/0

Now I’ve added 256 loopback interfaces onto R1 and put them all into OSPF by using network 0.0.0.0 0.0.0.0 area 0. This means all those loopback interfaces will be part of the type1 LSA originated by R1. What happens though?

interface Loopback1
 ip address 1.2.3.1 255.255.255.255
!
interface Loopback2
 ip address 1.2.3.2 255.255.255.255
!
interface Loopback3
!
[etc etc etc]
!
router ospf 1
 network 0.0.0.0 0.0.0.0 area 0

At first, nothing seems wrong. But take a look at the database from R1 and R2′s perspective. Remember the database should be identical.

R1#sh ip ospf database

            OSPF Router with ID (1.2.3.255) (Process ID 1)

                Router Link States (Area 0)

Link ID         ADV Router      Age         Seq#       Checksum Link count
1.2.3.255       1.2.3.255       33          0x80000005 0x00767B 257
10.0.0.2        10.0.0.2        100         0x80000011 0x00C816 2
R2#sh ip ospf database

            OSPF Router with ID (10.0.0.2) (Process ID 1)

                Router Link States (Area 0)

Link ID         ADV Router      Age         Seq#       Checksum Link count
1.2.3.255       1.2.3.255       130         0x80000004 0x00856D 2
10.0.0.2        10.0.0.2        128         0x80000011 0x00C816 2

R1 sees a link count of 257 for R1s router LSA, while R2 only sees 2. This can be confimred by seeing that R2 doesn’t have any OSPF routers to R1′s loopback:

R2#sh ip route ospf | beg Gate
Gateway of last resort is not set

If you wait a while you’ll see LOADING on the adjacency too. And eventually the adjacency resets and tries again:

R2#sh ip ospf neighbor

Neighbor ID     Pri   State           Dead Time   Address         Interface
1.2.3.255         0   LOADING/  -     00:00:32    10.0.0.1        GigabitEthernet1/0
R2#
*Feb  6 19:11:26.958: %OSPF-5-ADJCHG: Process 1, Nbr 1.2.3.255 on GigabitEthernet1/0 
from LOADING to DOWN, Neighbor Down: Too many retransmissions

So what exactly is happening? If you check Wireshark you’ll see the issue straight away
ospfmtu1 The dangers of ignoring OSPF MTU
ospfmtu2 The dangers of ignoring OSPF MTU
OSPF does not do any sort of path MTU discovery. R1 is attempting to send a type1 LSA and it’s using an MTU size of 2000. R2 cannot receive that large a frame and so those fragments get dropped. R2 never acknowledges the LSA as it’s not receiving anything, and eventually that causes the adjacency to reset. This then continues over and over.

This could be hidden though. Let’s stop R1 advertising all those addresses via it’s type1 LSA and instead redistribute the links into OSPF:

R1(config)#router ospf 1
R1(config-router)#no network 0.0.0.0 0.0.0.0 area 0
R1(config-router)#redistribute connected subnet
R1(config-router)#end

R2#sh ip ospf neighbor

Neighbor ID     Pri   State           Dead Time   Address         Interface
1.2.3.255         0   FULL/  -        00:00:38    10.0.0.1        GigabitEthernet1/0
R2#sh ip route ospf | beg Gate
Gateway of last resort is not set

      1.0.0.0/32 is subnetted, 255 subnets
O E2     1.2.3.1 [110/20] via 10.0.0.1, 00:01:14, GigabitEthernet1/0
O E2     1.2.3.2 [110/20] via 10.0.0.1, 00:01:14, GigabitEthernet1/0
O E2     1.2.3.3 [110/20] via 10.0.0.1, 00:01:14, GigabitEthernet1/0
O E2     1.2.3.4 [110/20] via 10.0.0.1, 00:01:14, GigabitEthernet1/0
O E2     1.2.3.5 [110/20] via 10.0.0.1, 00:01:14, GigabitEthernet1/0
O E2     1.2.3.6 [110/20] via 10.0.0.1, 00:01:14, GigabitEthernet1/0
O E2     1.2.3.7 [110/20] via 10.0.0.1, 00:01:14, GigabitEthernet1/0
O E2     1.2.3.8 [110/20] via 10.0.0.1, 00:01:14, GigabitEthernet1/0
O E2     1.2.3.9 [110/20] via 10.0.0.1, 00:01:14, GigabitEthernet1/0
[etc]

This time it works, even with a mismatched MTU. Why? A type 5 LSA only has space for a single address. This means that R1 originates 255 type 5 LSAs and each of those LSA are much much smaller than 2000 bytes. This means that the LSA updates are not bigger than 1500 bytes and so we never have R2 dropping any of those packets.

A router only originates a single router LSA, and that single LSA has to contain all the interface addresses for that router that is enabled for OSPF in the area. If a router has 1000 interfaces, well that’s a large type1.

You can see the individual type5s in the database itself:

R2#sh ip ospf database | beg External
                Type-5 AS External Link States

Link ID         ADV Router      Age         Seq#       Checksum Tag
1.2.3.1         1.2.3.255       390         0x80000001 0x00692A 0
1.2.3.2         1.2.3.255       390         0x80000001 0x005F33 0
1.2.3.3         1.2.3.255       390         0x80000001 0x00553C 0
1.2.3.4         1.2.3.255       390         0x80000001 0x004B45 0
1.2.3.5         1.2.3.255       390         0x80000001 0x00414E 0
1.2.3.6         1.2.3.255       390         0x80000001 0x003757 0
1.2.3.7         1.2.3.255       390         0x80000001 0x002D60 0
1.2.3.8         1.2.3.255       390         0x80000001 0x002369 0
1.2.3.9         1.2.3.255       390         0x80000001 0x001972 0
1.2.3.10        1.2.3.255       390         0x80000001 0x000F7B 0
[etc etc]

Out of interest, type3, type4, type5, and type7 LSAs all follow the ‘single address per LSA’ model and as such should never be that big. A type2 LSA will expand to reflect the amount of routers on the layer 2 segment, but I find it hard to believe that there would be over 100 routers on a single segment (though not impossible)

By the way, I wrote a separate post explaining a few more in-depth spf considerations when it comes to type1s and type5s over here: OSPF – Type 1 LSA vs Type 5 LSA (passive vs redistribute)

So there you have it. Ignore the MTU at your own peril. Rather fix the MTU issue than just ignoring it. It’s something that might not be an issue ‘now’ but as your router LSA grows in size you suddenly run into a problem.

Tagged with:  

PIM Assert. Is it really just metric?

On January 24, 2013, in CCIE, by Darren

EDIT: 24/01/13 – Thanks to IOS ping being funny, some of what is below is a bit inaccurate. Fear not, there will be an update! However not yet. Next week…

There seems to be a common myth that PIM assert meesages are based on route metric alone. This is not always true.
Let’s use the following diagram as a basis for this post. R1 is the source and R4 is the receiver. R4 is attached to the same segment as R2 and R3.
PIM assert 1 PIM Assert. Is it really just metric?
It’s inefficient to have both R2 and R3 forward multicast traffic onto this same segment. When a router receives a multicast packet on the same interface as it’s OIL, it know another router on that segment is sending the same traffic. At this point both will send PIM assert messages to let the other router know who should be sending packets onto this segment.
PIM assert 2 PIM Assert. Is it really just metric?
Let’s see this in practice. For now all routers are running OSPF with default metrics. I will ping the 239.1.1.1 group from R1 with a source of 1.1.1.1 – These packets will go to both R2 and R3 and both will initially send frames off to R4. At this point both will realise that the other router is also sending and so they will need to assert themselves.

R1#ping 239.1.1.1 so lo0 rep 5

Type escape sequence to abort.
Sending 5, 100-byte ICMP Echos to 239.1.1.1, timeout is 2 seconds:
Packet sent with a source address of 1.1.1.1
.
Reply to request 1 from 10.0.234.4, 88 ms
Reply to request 2 from 10.0.234.4, 84 ms
Reply to request 3 from 10.0.234.4, 84 ms
Reply to request 4 from 10.0.234.4, 84 ms

Let’s take a look at what debug ip pim shows us on R2:

*Jan 24 08:22:17.631: PIM(0): Received v2 Assert on FastEthernet1/1 from 10.0.234.3
*Jan 24 08:22:17.635: PIM(0): Assert metric to source 10.0.13.1 is [0/0]
*Jan 24 08:22:17.639: PIM(0): We lose, our metric [110/2]
*Jan 24 08:22:17.639: PIM(0): Prune FastEthernet1/1/239.1.1.1 from (10.0.13.1/32, 239.1.1.1)

R2 receives an assert message from R3 saying it’s metric is better. R2 therefore prunes fa1/1 off the OIL for this group. But look a bit deeper. R2 says it’s metric is 110/2 which we can confirm:

R2#sh ip route 1.1.1.1
Routing entry for 1.1.1.1/32
  Known via "ospf 1", distance 110, metric 2, type intra area
  Last update from 10.0.12.1 on FastEthernet1/0, 00:17:48 ago
  Routing Descriptor Blocks:
  * 10.0.12.1, from 1.1.1.1, 00:17:48 ago, via FastEthernet1/0
      Route metric is 2, traffic share count is 1

R2′s route back to the source has an AD of 110 and a metric of 2. So already we know that AD also comes into play here. But why does R3 say it’s metric is 0/0?

*Jan 24 08:22:17.579: PIM(0): Send v2 Assert on FastEthernet1/1 for 239.1.1.1,
 source 10.0.13.1, metric [0/0]
*Jan 24 08:22:17.583: PIM(0): Assert metric to source 10.0.13.1 is [0/0]

You can see in the previous debug on R2 that it does receive those values from R3. But R3′s metric is like so:

R3#sh ip route 1.1.1.1
Routing entry for 1.1.1.1/32
  Known via "ospf 1", distance 110, metric 3, type intra area
  Last update from 10.0.234.2 on FastEthernet1/1, 00:19:36 ago
  Routing Descriptor Blocks:
  * 10.0.234.2, from 1.1.1.1, 00:19:36 ago, via FastEthernet1/1
      Route metric is 3, traffic share count is 1

Even wireshark confirms that R3′s values are 0/0:
PIM assert 3 PIM Assert. Is it really just metric?
What happens if I up R3′s link cost to R1?

R3(config)#int fa1/0
R3(config-if)#ip ospf cost 5

Well, the outputs are interesting in that they match the above output. eh? R3 is still sending it’s metric as 0/0 even though because of the metric change it’s best route is actually through R2. That’s messed up surely.

The only ‘special’ thing about R3 at the moment is that it is the DR for the segment. I lowered the DR priority of R4 to zero and hence R3 became the DR as it had the highest IP address on the segment.

This is where things start to get interesting now. If metric were the only thing to be checked here, what would happen if R2 and R3 had routes back to 1.1.1.1 via different protocols? Protocol metrics are not comparable which is why you need seed metrics (or use built-in seed metrics) when redistributing.

Let’s create a static route to 1.1.1.1 on R2:

R2(config)#ip route 1.1.1.1 255.255.255.255 10.0.12.1
R2(config)#end
R2#
*Jan 24 08:46:08.355: %SYS-5-CONFIG_I: Configured from console by console
R2#sh ip route 1.1.1.1
Routing entry for 1.1.1.1/32
  Known via "static", distance 1, metric 0
  Routing Descriptor Blocks:
  * 10.0.12.1
      Route metric is 0, traffic share count is 1

The AD is 1, the metric is zero.

Who will now forward multicast traffic for the segment? If R3 continues to assert itself as 0/0 it’ll win, if not then R2 will win.

Well, no use posting the output as it’s exactly the same. R3 is still asserting itself as 0/0 and hence is ‘winning’ again.

*Jan 24 08:48:20.163: PIM(0): Received v2 Assert on FastEthernet1/1 from 10.0.234.3
*Jan 24 08:48:20.167: PIM(0): Assert metric to source 10.0.13.1 is [0/0]

This all looks to me like the DR is forcing it’s hand here. You can’t beat a 0/0 AD-metric pair. Maybe the IOS implementation says it must be this way? What does the actual RFC say (RFC4601)?

I am Assert Winner (W)
This router has won an (S,G) assert on interface I. It is now
responsible for forwarding traffic from S destined for G out of
interface I. Irrespective of whether it is the DR for I, while a
router is the assert winner, it is also responsible for forwarding
traffic onto I on behalf of local hosts on I that have made
membership requests that specifically refer to S (and G).

I am Assert Loser (L)
This router has lost an (S,G) assert on interface I. It must not
forward packets from S destined for G onto interface I. If it is
the DR on I, it is no longer responsible for forwarding traffic
onto I to satisfy local hosts with membership requests that
specifically refer to S and G.

Assert metrics are defined as:

struct assert_metric {
rpt_bit_flag;
metric_preference;
route_metric;
ip_address;
};
When comparing assert_metrics, the rpt_bit_flag, metric_preference,
and route_metric field are compared in order, where the first lower
value wins. If all fields are equal, the primary IP address of the
router that sourced the Assert message is used as a tie-breaker, with
the highest IP address winning.

There is this tidbit though:

2. Behavior: The assert winner for (*,G) acts as the local DR for
(*,G) on behalf of IGMP/MLD members.

Assert messages are to elect the Designated Forwarder for the segment. The Designated Router is a different election. The DR is responsible for sending up IGMP joins, while the DF is responsible for forwarding traffic onto the segment. This single line in the RFC to me states that these roles somehow join. Maybe this is why R3 continues to force itself as the DF? It looks like it.

It would be interesting to try the above test in Junos to see what it’s implementation does. I won’t have access to my Junos lab for a few days though so it will have to be another time. I was interested to know as different vendors use different AD values for protocols, and hence a router could in theory ‘beat’ another vendors router even if they have the same protocol metric but different AD values.

If Junos also uses 0/0 when it is a DR, then that would explain a few things as well.

Let’s make R2 the DR and see what happens:

R2(config)#int fa1/1
R2(config-if)#ip pim dr-priority 1000

Oddly now I don’t see any prune messages. Debugging on R1 shows that R3 is immediately sending a prune to R1 and hence R3 doesn’t even begin to forward traffic out on the shared segment. That means no assert is needed.

Hopefully this helps a few people out there. Sometimes things are a bit more tricky than they appear…

Tagged with:  

There is a fair amount of confusion about what exactly ip pim autorp listener actually does, or on which router to configure it. This probably stems from the fact that it has the word ‘listener’ in the command when it really doesn’t configure the router to listen for anything. Let’s take the following diagram into consideration for this post:
autorplistener ip pim autorp listener should be named ip pim autorp forwarder
I’ve configured OSPF on all interfaces, including loopbacks, and configured ip pim sparse-mode on all interfaces

Auto-RP uses the multicast addresses of 224.0.1.39 and 224.0.1.40
224.0.1.39 is the AUTO-RP-ANNOUNCE group and is used by the mapping agent (send-rp-discovery) router to listen for routers sending candidate announcements. i.e. Candidate RPs send multicast traffic TO 224.0.1.39 and the mapping agent joins this group to RECEIVE that traffic
224.0.1.40 is the AUTO-RP-DISCOVERY group and is sent from the mapping agent out so that other multicast routers can get information on which address is the rendezvous point. i.e. The mapping agent sends traffic to this group, and all multicast routers join this group so that they can receive this traffic

The problem is that routers are running sparse-mode. How can they join a group if they don’t know the location of the RP? A classic chicken and egg scenario.

Let’s start off with a basic config. We’ll check the mroute table along the way as well as check what auto-rp packets are getting sent by which routers via wireshark. On R1 I’ve enabled ip pim on it’s interfaces and enabled ip multicast-routing. Nothing else. Let’s take a look at the mroute table:

R1#sh ip mroute | beg \(
(*, 224.0.1.40), 00:03:25/00:02:25, RP 0.0.0.0, flags: DPL
  Incoming interface: Null, RPF nbr 0.0.0.0
  Outgoing interface list: Null

This router is already listening to the 224.0.1.40 group. The group that contains messages from the mapping agent to let it know where the RP is. So the router is already ‘listening’ The flags are DPL and mean Dense/Pruned/Local. Checking wireshark, there is no traffic destined to 224.0.1.39 or .40 as expected.

On R1, let’s configure the router to it considers itself a candidate RP:

R1(config)#ip pim send-rp-announce lo0 scope 10 interval 5

I’m also going to debug any auto-rp messages via the other routers:

R2#debug ip pim auto-rp
PIM Auto-RP debugging is on

Wireshark shows me that R1 is sending an auto-rp packet every 5 seconds. Source is 1.1.1.1 and destination is 224.0.1.39 which is expected. 224.0.1.39 is the group that the mapping agent will join to receive these announcements.
autorplistener 1 ip pim autorp listener should be named ip pim autorp forwarder
So wireshark is showing that R2 receives these packets. Wireshark also shows that R2 does not forward these packets in a dense-mode fashion (I’m not going to paste an empty wireshark capture)

Now I’m going to enable ip pim autorp listener on R2. This will cause R2 to flood the two autorp groups in a dense mode fashion. This means R3 will receive those frames, but not forward them onto R4:

R2(config)#ip pim autorp listener

Straight away on gi2/0 of R2 I see the following packet getting sent:
autorplistener 2 ip pim autorp listener should be named ip pim autorp forwarder
This now means that R3 is receiving packets destined to 224.0.1.39, but R4 still isn’t. But let’s use this to our advantage. Let’s make R3 the mapping agent. This will cause it to consider R1 to be the RP and it will flood those frames out it’s directly connected interfaces.

R3(config)#ip pim send-rp-discovery lo0 scope 10 interval 5

R3 already received the packets destined to 224.0.1.39 via R2. This allows it to make an informed decision and maps R1 to be the RP for 224.0.0.0/4 – It then sends messages out to 224.0.1.40 which all routers are already listening for
autorplistener 3 ip pim autorp listener should be named ip pim autorp forwarder

My debug auto-rp shows the following on R1, R2, and R4:

*Jan 14 13:27:29.531: Auto-RP(0): Received RP-discovery packet of length 48, from 3.3.3.3, RP_cnt 1, ht 16
*Jan 14 13:27:29.531: Auto-RP(0): Update (224.0.0.0/4, RP:1.1.1.1), PIMv2 v1

Even R4 knows about the RP and can now join groups:

R4#sh ip pim rp map
PIM Group-to-RP Mappings

Group(s) 224.0.0.0/4
  RP 1.1.1.1 (?), v2v1
    Info source: 3.3.3.3 (?), elected via Auto-RP
         Uptime: 00:04:16, expires: 00:00:15

However R5 has no mappings whatsoever. This is because while R4 is receiving packets destined to 224.0.1.40, it’s not forwarding them. So let’s configure autorp listener on R4:

R4(config)#ip pim autorp listener

Straight away on R5:

*Jan 14 13:32:18.511: Auto-RP(0): Received RP-discovery packet of length 48, from 3.3.3.3, RP_cnt 1, ht 16
*Jan 14 13:32:18.511: Auto-RP(0): Added with (224.0.0.0/4, RP:1.1.1.1), PIMv2 v1

Wireshark confirms that autorp packets are now being sent out R4′s gi2/0 interface

So here I have five routers running auto-rp in sparse mode. I’ve only configured autorp listener on two routers – R2 and R4

Conclusions

  • Autorp listener doesn’t cause the router to listen for anything, rather it causes the router to dense flood groups 224.0.1.39 and 224.0.1.40 when it receives them
  • This means that routers at the end of the path, or certain routers running as the candidate RP or MA do not actually need to be configured with autorp listener
  • I mentioned certain routers, as if your topology has multiple candidate RPs or MAs then each would would need to ensure that dense flood packets from OTHER RPs or MAs are forwarded
  • Think of autorp listener as a autorp forwarder command as that’s essentially what the command does
  • Use BSR if you can. It’s an open protocol (unlike autorp, even though Junos DOES support autorp) and RPs and MAs are lean’t on a hop by hop basis through PIMv2
Tagged with:  

Private Vlans – Control Plane/Data Plane

On December 9, 2012, in CCIE, by Darren

Private vlans are actually very easy to configure. But what’s actually happening at the data plane level? Knowing this helps when we need to span our pvlans through switches that do not support private vlans.

 

Let’s use the following topology for these tests. The routers are just acting as hosts. I also have a laptop connected to port fa0/48 on SW3 to capture some frames. All routers are in the 10.0.0.x/24 range.

data pvlan 1 Private Vlans   Control Plane/Data Plane

 

SW1 is a 3750 and SW2 is a 3560. Both support pvlans. SW3 is a 3550 which has no concept of private vlans.

The first thing I’m going to do is create the primary private vlan with an id of 50. I’ll then create a secondary community vlan with a vlaue of 24.  I’ll then put R2′s and R4′s port into the pvlan.

R1:

vlan 24
  private-vlan community
!
vlan 50
  private-vlan primary
  private-vlan association 24
!
interface FastEthernet1/0/4
 switchport private-vlan host-association 50 24
 switchport mode private-vlan host

At this point if I ping R4 from R2 the ping fails:

R2#ping 10.0.0.4

Type escape sequence to abort.
Sending 5, 100-byte ICMP Echos to 10.0.0.4, timeout is 2 seconds:
.....
Success rate is 0 percent (0/5)

If two community ports are trying to speak to each other, SW1 and SW2 will use the community vlan value id for traffic in both ways. The community vlan is 24, so let’s try creating a regular vlan 24 on SW3:

SW3(config)#vlan 24
SW3(config-vlan)#end

Can we ping now?

R2#ping 10.0.0.4

Type escape sequence to abort.
Sending 5, 100-byte ICMP Echos to 10.0.0.4, timeout is 2 seconds:
!!!!!
Success rate is 100 percent (5/5), round-trip min/avg/max = 1/6/20 ms

It doesn’t matter that SW3 doesn’t support pvlans at this point. SW1 and SW2 simply use a generic vlan tag of 24. This can be shown in the wireshark capture for both the ping request and response:
pvlan 1 Private Vlans   Control Plane/Data Plane

pvlan 2 Private Vlans   Control Plane/Data Plane
SW3 doesn’t know this is a private-vlan. It simply sees traffic for vlan 24 passing through it’s trunk ports. Does this mean that we can put R3′s port into vlan 24 and it’ll be able to ping both R2 and R4? Well it should, but let’s find out.
SW3:

interface FastEthernet0/1
 switchport access vlan 24
 switchport mode access

R2:

R2#ping 10.0.0.3

Type escape sequence to abort.
Sending 5, 100-byte ICMP Echos to 10.0.0.3, timeout is 2 seconds:
!!!!!
Success rate is 100 percent (5/5), round-trip min/avg/max = 4/7/16 ms

No problems there. Wireshark shows the same regular vlan 24 being used between the switches.

Now let’s add a promiscuous port. I’ll make port fa1/0/1 on SW1 a promiscuous port which will represent the gateway on this subnet.
SW1:

interface FastEthernet1/0/1
 switchport private-vlan mapping 50 24
 switchport mode private-vlan promiscuous

Can R4 and R2 ping R1?

R4#ping 10.0.0.1

Type escape sequence to abort.
Sending 5, 100-byte ICMP Echos to 10.0.0.1, timeout is 2 seconds:
!!!!!
Success rate is 100 percent (5/5), round-trip min/avg/max = 4/6/16 ms
R2#ping 10.0.0.1

Type escape sequence to abort.
Sending 5, 100-byte ICMP Echos to 10.0.0.1, timeout is 2 seconds:
.....
Success rate is 0 percent (0/5)

R4 can ping, R2 can’t. Why it can’t will become clear very shortly. Let’s add vlan 50 on SW3 and try again on R2.

SW3#conf t
Enter configuration commands, one per line.  End with CNTL/Z.
SW3(config)#vlan 50
SW3(config-vlan)#end
R2#ping 10.0.0.1

Type escape sequence to abort.
Sending 5, 100-byte ICMP Echos to 10.0.0.1, timeout is 2 seconds:
!!!!!
Success rate is 100 percent (5/5), round-trip min/avg/max = 4/8/20 ms

Now R2 can ping. To figure out why, let’s have a look at what wireshark is telling us.
pvlan 3 Private Vlans   Control Plane/Data Plane

pvlan 4 Private Vlans   Control Plane/Data Plane
It’s quite clear from the captures above. When traffic needs to get from a community port to an promiscuous port, SW1 and SW2 will use vlan 24 unidirectionally to get to the promiscuous port. Traffic going from the promiscuous port back to the community ports will use the primary vlan, which is id 50. (Isolated ports are the same in this regard)
As soon as vlan 50 is allowed through SW3, R2 can speak to the segment gateway. Once again SW3 has no idea that vlan 50 is any sort of private vlan.

However, this does cause a problem if R3 tries to get to the default gateway. R3 belongs to vlan 24 only, and so will never receive any reply from R1:

R3#ping 10.0.0.1

Type escape sequence to abort.
Sending 5, 100-byte ICMP Echos to 10.0.0.1, timeout is 2 seconds:
.....
Success rate is 0 percent (0/5)

R3 is sending out an ARP request which R1 does  receive. This can be proved with a debug arp on R1:

R1#debug arp
ARP packet debugging is on
R1#
*Dec  9 13:04:55.339: IP ARP: rcvd req src 10.0.0.3 ca02.0b28.0008, dst 10.0.0.1 FastEthernet0/0
*Dec  9 13:04:55.339: IP ARP: sent rep src 10.0.0.1 ca00.0b28.0008,
                 dst 10.0.0.3 ca02.0b28.0008 FastEthernet0/0

The problem is that the reply is getting sent over vlan 50, of which R3 is not a part of. What happens if we make R3 part of vlan 50 on SW3?

SW3(config)#int fa0/1
SW3(config-if)#swit acc vlan 50
SW3(config-if)#end
R3#ping 10.0.0.1

Type escape sequence to abort.
Sending 5, 100-byte ICMP Echos to 10.0.0.1, timeout is 2 seconds:
!!!!!
Success rate is 100 percent (5/5), round-trip min/avg/max = 4/7/16 ms

That worked. Surely SW1 would assume that incoming traffic on vlan 50 can’t be correct as traffic from the associated ports should come in on the secondary vlan? Well there is a reason this works which I’ll cover in a bit. Wireshark proves that both the request and replys were send using vlan 50. Of course R3 no longer has access to R2 or R4:

R3#ping 10.0.0.2

Type escape sequence to abort.
Sending 5, 100-byte ICMP Echos to 10.0.0.2, timeout is 2 seconds:
.....
Success rate is 0 percent (0/5)
R3#ping 10.0.0.4

Type escape sequence to abort.
Sending 5, 100-byte ICMP Echos to 10.0.0.4, timeout is 2 seconds:
.....
Success rate is 0 percent (0/5)

What if we wanted to create an SVI on all 3 of our switches so that each had a layer3 interface in the vlan? Let’s try to create int vlan 24 on SW1:

SW1(config)#int vlan 24
SW1(config-if)#
*Mar  1 02:16:48.839: %PV-6-PV_SVI_DOWN: Vlan 24's interface remains down because this vlan is a secondary vlan.
*Mar  1 02:16:48.848: %LINEPROTO-5-UPDOWN: Line protocol on Interface Vlan24, changed state to down

This doesn’t work. SVI’s will only work when you use the primary vlan id:

interface Vlan50
 ip address 10.0.0.11 255.255.255.0

This SVI is not part of the community though. This means it should be able to ping R1, R3, but not R2 or R4:

SW1#ping 10.0.0.1

Type escape sequence to abort.
Sending 5, 100-byte ICMP Echos to 10.0.0.1, timeout is 2 seconds:
!!!!!
Success rate is 100 percent (5/5), round-trip min/avg/max = 1/208/1015 ms
SW1#ping 10.0.0.2

Type escape sequence to abort.
Sending 5, 100-byte ICMP Echos to 10.0.0.2, timeout is 2 seconds:
.....
Success rate is 0 percent (0/5)
SW1#ping 10.0.0.3

Type escape sequence to abort.
Sending 5, 100-byte ICMP Echos to 10.0.0.3, timeout is 2 seconds:
!!!!!
Success rate is 100 percent (5/5), round-trip min/avg/max = 1/206/1007 ms
SW1#ping 10.0.0.4

Type escape sequence to abort.
Sending 5, 100-byte ICMP Echos to 10.0.0.4, timeout is 2 seconds:
.....
Success rate is 0 percent (0/5)

Which is what we see. In fact if I ping from SW2 to R1, the vlans in both directions should be vlan 50. Are they?
pvlan 5 Private Vlans   Control Plane/Data Planepvlan 6 Private Vlans   Control Plane/Data Plane
Yes they are. This is why the ping from R3 worked earlier. When the frame entered SW1′s trunk interface encapsulated with vlan 50, it has no idea this isn’t coming from a pvlan switch. It simply sees a source of vlan 50, which is valid. Yes community and isolated ports will be coming inbound with their respective secondary vlans, but a frame coming in tagged with the primary vlan is still a valid frame.

What happens if I don’t want to use R1 as the gateway. Rather I would like to use SW2′s SVI interface. At the moment none of the community ports can actually ping the SVIs as they are not promiscuous. We can’t specifically make them promiscuous, but we can map vlans to an SVI interface:
SW2:

interface Vlan50
 ip address 10.0.0.12 255.255.255.0
 private-vlan mapping 24
R2#ping 10.0.0.12

Type escape sequence to abort.
Sending 5, 100-byte ICMP Echos to 10.0.0.12, timeout is 2 seconds:
!!!!!
Success rate is 100 percent (5/5), round-trip min/avg/max = 1/3/4 ms
R4#ping 10.0.0.12

Type escape sequence to abort.
Sending 5, 100-byte ICMP Echos to 10.0.0.12, timeout is 2 seconds:
!!!!!
Success rate is 100 percent (5/5), round-trip min/avg/max = 4/7/20 ms

For now, this means that SW3′s SVI should be able to ping SW1, SW2, R1, but neither R2 nor R4:

SW3#ping 10.0.0.11

Type escape sequence to abort.
Sending 5, 100-byte ICMP Echos to 10.0.0.11, timeout is 2 seconds:
!!!!!
Success rate is 100 percent (5/5), round-trip min/avg/max = 1/4/8 ms
SW3#ping 10.0.0.12

Type escape sequence to abort.
Sending 5, 100-byte ICMP Echos to 10.0.0.12, timeout is 2 seconds:
!!!!!
Success rate is 100 percent (5/5), round-trip min/avg/max = 4/4/4 ms
SW3#ping 10.0.0.1

Type escape sequence to abort.
Sending 5, 100-byte ICMP Echos to 10.0.0.1, timeout is 2 seconds:
!!!!!
Success rate is 100 percent (5/5), round-trip min/avg/max = 1/202/1000 ms
SW3#ping 10.0.0.2

Type escape sequence to abort.
Sending 5, 100-byte ICMP Echos to 10.0.0.2, timeout is 2 seconds:
.....
Success rate is 0 percent (0/5)
SW3#ping 10.0.0.4

Type escape sequence to abort.
Sending 5, 100-byte ICMP Echos to 10.0.0.4, timeout is 2 seconds:
.....
Success rate is 0 percent (0/5)

There are a couple of interesting things to note. If I remove the associations in my vlan config, then any interface using pvlans associated with the wrongly configured pvlan will go down:

SW1#conf t
Enter configuration commands, one per line.  End with CNTL/Z.
SW1(config)#vlan 50
SW1(config-vlan)#no private-vlan association 24
SW1(config-vlan)#end
SW1#
*Mar  1 03:07:08.604: %PV-6-PV_MSG: Purged a private vlan mapping, Primary 50, Secondary 24
*Mar  1 03:07:08.621: %SYS-5-CONFIG_I: Configured from console by console
*Mar  1 03:07:09.619: %LINEPROTO-5-UPDOWN: Line protocol on Interface FastEthernet1/0/1, changed state to down
*Mar  1 03:07:09.619: %LINEPROTO-5-UPDOWN: Line protocol on Interface FastEthernet1/0/4, changed state to down

This can be a bit tricky to troubleshoot, as there is no reason given for the port being down:

SW1#sh int fa1/0/1
FastEthernet1/0/1 is up, line protocol is down (notconnect)

The only way to figure that out (at least I can find) is to have a look at the vlans configured, or check the switchport command:

SW1#sh int fa1/0/1 switchport | include Administrative|Operational
Administrative Mode: private-vlan promiscuous
Operational Mode: down
Administrative Trunking Encapsulation: negotiate
Administrative Native VLAN tagging: enabled
Administrative private-vlan host-association: none
Administrative private-vlan mapping: 50 (VLAN0050) 24 (VLAN0024)
Administrative private-vlan trunk native VLAN: none
Administrative private-vlan trunk Native VLAN tagging: enabled
Administrative private-vlan trunk encapsulation: dot1q
Administrative private-vlan trunk normal VLANs: none
Administrative private-vlan trunk associations: none
Administrative private-vlan trunk mappings: none
Operational private-vlan: none

The administrative mode shows private-vlan, but the operational mode shows none. Not that easy to find.

Conclusions

  • It’s not essential to have a switch that suports pvlan as a transport switch between 2 pvlan switches
  • While you could add a host to the transport switch, it’s either going to be able to speak to the hosts, or the gateway. Not both at the same time
  • Just keep hosts or the routing SVI off the transport switch. Keep these limited to the pvlan switches
  • In the data-plane, the only distinguishing feature of the packet is the dot1q header. This is why you can use non pvlan switches in the middle. Your transport switches are simply forwarding based on the MAC address in the vlans configured over the trunk links
  • SVIs can only be created in the primary vlan id. A Secondary SVI id will not come up
  • SVIs are not promiscuous by default. You need to map secondary vlans to an SVI port
  • If you need to monitor pvlan traffic on a non-pvlan switch, you’ll need to monitor both the primary and secondary vlan id.
Tagged with:  

802.1s – Multiple Spanning Tree – Regions

On December 5, 2012, in CCIE, by Darren

I’m not going into the basics of 802.1s as there is plenty of documentation showing that. The main point of this blog is to see how the actual regions work.

For this blog I’ll be using the following topology:
MST regions 2 802.1s   Multiple Spanning Tree   Regions

I’ve created vlan 10, 20, 30, 40, 50, 60, 70, 80, 90, and 100 on all devices. VTP is OFF. I have created int vlan 10 and int vlan 30 on each switch, with addressing like so: 10.10.10.x and 30.30.30.x (x being the switch number) This will allow us to test connectivity.

I’ve got the following config on all these switches:

spanning-tree mode mst
!
spanning-tree mst configuration
 name mellowd
 revision 1
 instance 1 vlan 10, 30, 70
 instance 2 vlan 20, 40, 50

Any vlan not associated with an instance is automatically associated with instance 0. We can check this:

SW1#show span mst con
Name      [mellowd]
Revision  1     Instances configured 3

Instance  Vlans mapped
--------  ---------------------------------------------------------------------
0         1-9,11-19,21-29,31-39,41-49,51-69,71-4094
1         10,30,70
2         20,40,50
-------------------------------------------------------------------------------

MST considers switches to be in the same region, as long as their vlan to instance mapping, name, and revision match. If any one of these are different, they are in different regions. As they all currently match, let’s have a look at the spanning tree:

SW2#sh span mst 0

##### MST0    vlans mapped:   1-9,11-19,21-29,31-39,41-49,51-69,71-4094
Bridge        address 001c.f903.d580  priority      32768 (32768 sysid 0)
Root          address 0012.daf2.c300  priority      32768 (32768 sysid 0)
              port    Fa0/23          path cost     0
Regional Root address 0012.daf2.c300  priority      32768 (32768 sysid 0)
                                      internal cost 200000    rem hops 19
Operational   hello time 2 , forward delay 15, max age 20, txholdcount 6
Configured    hello time 2 , forward delay 15, max age 20, max hops    20

Interface        Role Sts Cost      Prio.Nbr Type
---------------- ---- --- --------- -------- --------------------------------
Fa0/20           Altn BLK 200000    128.22   P2p
Fa0/23           Root FWD 200000    128.25   P2p

SW2 is showing SW1 as the root of MST 0. It also shows up as the regional root. I’ll expand on that a bit more later. The ports are both point-to-point. We can expand on the actual spanning-tree interface to see that:

SW2#sh span mst interface  fa0/23

FastEthernet0/23 of MST0 is root forwarding
Edge port: no             (default)        port guard : none        (default)
Link type: point-to-point (auto)           bpdu filter: disable     (default)
Boundary : internal                        bpdu guard : disable     (default)
Bpdus sent 9, received 399

Instance Role Sts Cost      Prio.Nbr Vlans mapped
-------- ---- --- --------- -------- -------------------------------
0        Root FWD 200000    128.25   1-9,11-19,21-29,31-39,41-49,51-69
                                     71-4094
1        Root FWD 200000    128.25   10,30,70
2        Root FWD 200000    128.25   20,40,50

You’ll notice the Boundry shows as internal.

Let’s take a look at the tree from SW4′s perspective for vlan 10:

SW4#sh span vlan 10

MST1
  Spanning tree enabled protocol mstp
  Root ID    Priority    32769
             Address     0012.daf2.c300
             Cost        200000
             Port        21 (FastEthernet0/21)
             Hello Time   2 sec  Max Age 20 sec  Forward Delay 15 sec

  Bridge ID  Priority    32769  (priority 32768 sys-id-ext 1)
             Address     0017.0e23.d380
             Hello Time   2 sec  Max Age 20 sec  Forward Delay 15 sec

Interface           Role Sts Cost      Prio.Nbr Type
------------------- ---- --- --------- -------- --------------------------------
Fa0/20              Desg FWD 200000    128.20   P2p
Fa0/21              Root FWD 200000    128.21   P2p
Fa0/22              Altn BLK 200000    128.22   P2p
Fa0/24              Desg FWD 200000    128.24   P2p

Vlan 10 and vlan 30 are part of the same MST instance. They share the same tree. If you manually prune certain vlans off certain links, this can spell disaster in an MST set up. Let’s check if SW4 has connectivity to SW1′s vlan 10 and vlan 30 interfaces:

SW4#ping 10.10.10.1

Type escape sequence to abort.
Sending 5, 100-byte ICMP Echos to 10.10.10.1, timeout is 2 seconds:
!!!!!
Success rate is 100 percent (5/5), round-trip min/avg/max = 1/2/4 ms
SW4#ping 30.30.30.1

Type escape sequence to abort.
Sending 5, 100-byte ICMP Echos to 30.30.30.1, timeout is 2 seconds:
!!!!!
Success rate is 100 percent (5/5), round-trip min/avg/max = 1/2/4 ms

fa0/21 is currently the root port. If I prune vlan 30 off that link, it will NOT use the alternative port. In PVST+ it will, since the spanning-tree for vlan 30 will recalculate

interface FastEthernet0/21
 switchport trunk allowed vlan 1-29,31-4094

SW4#ping 30.30.30.1

Type escape sequence to abort.
Sending 5, 100-byte ICMP Echos to 30.30.30.1, timeout is 2 seconds:
.....
Success rate is 0 percent (0/5)
SW4#ping 10.10.10.1

Type escape sequence to abort.
Sending 5, 100-byte ICMP Echos to 10.10.10.1, timeout is 2 seconds:
!!!!!
Success rate is 100 percent (5/5), round-trip min/avg/max = 1/3/4 ms

vlan 30 traffic is now getting black-holed, while vlan 10 still works. I’ll remove the prune to move onto the next.

Now let’s say we add another vlan mapping to SW2. We create vlan 110 and map it to instance 2. What happens?

SW2#sh span mst configuration
Name      [mellowd]
Revision  1     Instances configured 3

Instance  Vlans mapped
--------  ---------------------------------------------------------------------
0         1-9,11-19,21-29,31-39,41-49,51-69,71-109,111-4094
1         10,30,70
2         20,40,50,110
-------------------------------------------------------------------------------

If I now check the MST0:

SW2#sh span mst 0

##### MST0    vlans mapped:   1-9,11-19,21-29,31-39,41-49,51-69,71-109
                               111-4094
Bridge        address 001c.f903.d580  priority      32768 (32768 sysid 0)
Root          address 0012.daf2.c300  priority      32768 (32768 sysid 0)
              port    Fa0/23          path cost     200000
Regional Root this switch
Operational   hello time 2 , forward delay 15, max age 20, txholdcount 6
Configured    hello time 2 , forward delay 15, max age 20, max hops    20

Interface        Role Sts Cost      Prio.Nbr Type
---------------- ---- --- --------- -------- --------------------------------
Fa0/20           Altn BLK 200000    128.22   P2p Bound(RSTP)
Fa0/23           Root FWD 200000    128.25   P2p Bound(RSTP)

The ports have changed from P2p to PtP Bound(RSTP). Let’s take a look at the actual root port again:

SW2#sh span mst interface  fa0/23

FastEthernet0/23 of MST0 is root forwarding
Edge port: no             (default)        port guard : none        (default)
Link type: point-to-point (auto)           bpdu filter: disable     (default)
Boundary : boundary       (RSTP)           bpdu guard : disable     (default)
Bpdus sent 5, received 61

Instance Role Sts Cost      Prio.Nbr Vlans mapped
-------- ---- --- --------- -------- -------------------------------
0        Root FWD 200000    128.25   1-9,11-19,21-29,31-39,41-49,51-69
                                     71-109,111-4094
1        Mstr FWD 200000    128.25   10,30,70
2        Mstr FWD 200000    128.25   20,40,50,110

The boundry now shows up as boundry. These switches now consider themselves to be in different regions. All that has changed is we have added another vlan to instance 2. The name and revision is still the same, but remember all 3 have to match. As this is a boundry, they actually run rapid spanning tree between them.

A single region will present itself as a single bridge with multiple links to another switch. This means you could have 100 switches in an MST region connected with multiple links to a single 802.1d-2004 switch. That 802.1d-2004 will assume that all these links go to a single bridge.

If you connect multiple MST regions together, each region will have their own regional root, but they will see the best regional root as the actual root. You can check this on SW2:

SW2#sh span mst 0 detail

##### MST0    vlans mapped:   1-9,11-19,21-29,31-39,41-49,51-69,71-109
                               111-4094
Bridge        address 001c.f903.d580  priority      32768 (32768 sysid 0)
Root          address 0012.daf2.c300  priority      32768 (32768 sysid 0)
              port    Fa0/23          path cost     200000
Regional Root this switch

SW2 sees SW1 as the root bridge, but sees itself as the root of it’s own region. In order for multiple-region MST to work, the overall root bridge has to be in an MST region. If we make SW3 a non-MST bridge, and lower it’s priority to 0, it won’t work:

Sw3
spanning-tree mode rapid-pvst
!
spanning-tree vlan 1-4094 priority 0

I immediately get this error on SW4:

%SPANTREE-2-PVSTSIM_FAIL: Blocking root port Fa0/24: Inconsitent inferior PVST BPDU received on VLAN 10, claiming root 10:0017.0e23.a800

If you check the spanning-tree now:

SW4#sh span mst 0 | include Fa0/24
              port    Fa0/24          path cost     200000
Fa0/24           Root BKN*200000    128.24   P2p Bound(PVST) *PVST_Inc

SW4 has blocked this port. This means no traffic can get to SW3:

SW4#ping 10.10.10.3

Type escape sequence to abort.
Sending 5, 100-byte ICMP Echos to 10.10.10.3, timeout is 2 seconds:
.....
Success rate is 0 percent (0/5)

If I remove the priority on SW3, all goes back to normal:

%SPANTREE-2-PVSTSIM_OK: PVST Simulation inconsistency cleared on port FastEthernet0/24.

SW4#sh span mst 0 | include Fa0/24
Fa0/24           Desg FWD 200000    128.24   P2p Bound(PVST) *PVST_Inc
SW4#ping 10.10.10.3

Type escape sequence to abort.
Sending 5, 100-byte ICMP Echos to 10.10.10.3, timeout is 2 seconds:
!!!!!
Success rate is 100 percent (5/5), round-trip min/avg/max = 4/4/4 ms

Conclusions:

  • Vlan to instance mappings, revision, and instance name need to match in order for switches to be in the same region
  • Vlans do not actually need to be created, or even allowed over trunks in order to be mapped to an instance. The essential part of the vlan id to instance mapping
  • If any one of the above doesn’t match, switches are in different regions and will run RSTP between them
  • Manually pruning vlans can lead to black-holing of traffic
  • If running multiple regions with legacy switches, always ensure one of the MST switches is actually the root (just use priority 0)
Tagged with:  

© 2009-2013 Darren O'Connor All Rights Reserved