DHCP Snooping – Filter those broadcasts!

I had a specific requirement recently and I wanted to test it’s behaviour. In particular the feature is DHCP snooping. Let’s quickly go over the DHCP process at a high level to see how it works:

DHCP

Let’s take the following simple diagram to show what’s going on. We have a switch with two hosts connected. We also have a DHCP server. I’m using generic names as I’ll be testing this on different switches. Assume all devices are in the same vlan.
dhcp_snoop1

Host1 has just booted and needs an IP address. It’ll send a DHCP DISCOVER packet which is a broadcast. This broadcast gets sent to all ports in the vlan:
dhcp_snoop2

The DHCP server will then send an DHCP OFFER to host 1. It does this via unicast using the destination MAC as the layer 2 destination:
dhcp_snoop3

Host1 then send a DHCP REQUEST via broadcast. Why broadcast? This is because it may have received offers from multiple DHCP servers and is essentially telling all of them that they are accepting an offer some one of them.
dhcp_snoop4

Finally, the DHCP server acknowledges that Host1 has accepted it’s offered IP with a DHCP ack. This is unicast again:
dhcp_snoop5

Now, depending on bootp options, the offer and/or ack might actually be broadcast. The behaviour is also slightly different when using DHCP helpers, but we are mainly concerned with the DHCPDISCOVER and DHCPREQUEST packets which are always broadcast.

DHCP Snooping

In the above example, there was nothing stopping Host 2 from providing IP addresses via DHCP. This might either be malicious activity, or merely someone doing something wrong and either configuring a device wrong, or plugging in a device which should not be there.

DHCP snooping was created to prevent this from happening. DHCP’s main concern is making sure that DHCPOFFERS only come in via trusted ports. In our example port 1 connected to the DHCP server should be a trusted port. Port2 and port 3 connected to Host 1 and Host 2 respectively should never have DHCPOFFER packets on ingress. But here is the kicker. A DHCPOFFER is in response to an event. That event is a DHCPDISCOVER. That DHCPDISCOVER is a broadcast.

It stands to reason that if a DHCPOFFER cannot ever ingress port2 and port3, those ports should never have DHCPDISCOVER packets replicated to them to begin with, regardless of whether those packets are broadcast. All other broadcasts should go through, but these specific DHCP ones should not.

So is this what we actually see in the real world? I’ll test this on the devices I have available to see what behaviour I see.

Cisco Catalyst IOS

My config is as follows:

ip dhcp snooping vlan 1-4094                                                                           
ip dhcp snooping 

interface FastEthernet0/1                                                                              
 switchport access vlan 10                                                                             
 switchport mode access                                                                                
!                                                                                                      
interface FastEthernet0/2                                                                              
 switchport access vlan 10                                                                             
 switchport mode access        
!
interface FastEthernet0/24                                                                             
 switchport access vlan 10                                                                             
 switchport mode access                                                                                
 ip dhcp snooping trust     

DHCP snooping enabled with fa0/24 being the trusted port going towards my server.

I have host1 and host2 connected with the following MAC addresses:

  • 78:2b:cb:e4:e3:88
  • 00:26:5a:ef:85:33

I’ll now listen on fa0/24. I should see both DHCPDISCOVER broadcasts coming though:

$ sudo tcpdump -i eth1 -n port 67 and port 68
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on eth1, link-type EN10MB (Ethernet), capture size 65535 bytes
11:45:45.204815 IP 0.0.0.0.68 > 255.255.255.255.67: BOOTP/DHCP, Request from 78:2b:cb:e4:e3:88, length 300
11:45:48.733826 IP 0.0.0.0.68 > 255.255.255.255.67: BOOTP/DHCP, Request from 00:26:5a:ef:85:33, length 300

That’s exactly what I see.

If I now move the capture point over to fa0/2, I hope to see no broadcasts at all. If not, this would mean the device is not replicating those broadcasts out untrusted ports:

$ sudo tcpdump -i eth1 -n port 67 and port 68
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on eth1, link-type EN10MB (Ethernet), capture size 65535 bytes

Silence. That’s just what I wanted to see.

Juniper EX

Config is as follows:

root@ex2200-12> show configuration interfaces ge-0/0/2           
unit 0 {
    family ethernet-switching {
        port-mode access;
        vlan {
            members vlan_test;
        }
    }
}

{master:0}
root@ex2200-12> show configuration interfaces ge-0/0/3           
unit 0 {
    family ethernet-switching {
        port-mode access;
        vlan {
            members vlan_test;
        }
    }
}

{master:0}
root@ex2200-12> show configuration interfaces ge-0/0/4           
unit 0 {
    family ethernet-switching {
        port-mode access;
        vlan {
            members vlan_test;
        }
    }
}


root@ex2200-12> show configuration ethernet-switching-options    
secure-access-port {
    interface ge-0/0/4.0 {
        dhcp-trusted;
    }
    vlan all {
        examine-dhcp;
    }
}

ge-0/0/4 is now my trusted DHCP server port. If I listen on that port, I should see both devices broadcasts:

$ sudo tcpdump -i eth1 -n port 67 and port 68
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on eth1, link-type EN10MB (Ethernet), capture size 65535 bytes

11:58:02.539119 IP 0.0.0.0.68 > 255.255.255.255.67: BOOTP/DHCP, Request from 78:2b:cb:e4:e3:88, length 300
11:58:05.809947 IP 0.0.0.0.68 > 255.255.255.255.67: BOOTP/DHCP, Request from 00:26:5a:ef:85:33, length 300

What about when listening on the untrusted port?

$ sudo tcpdump -i eth1 -n port 67 and port 68
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on eth1, link-type EN10MB (Ethernet), capture size 65535 bytes
11:58:55.342651 IP 0.0.0.0.68 > 255.255.255.255.67: BOOTP/DHCP, Request from 78:2b:cb:e4:e3:88, length 300

I hear the broadcast come through. There is only one MAC as I’ve had to disconnect the host in order to listen via wireshark.

Conclusions

  • IOS switches filter the initial DHCPDISCOVER broadcast packets. Junos switches do not.
  • Both devices DO drop DHCPOFFER packets coming in on untrusted ports.
  • Cisco is a bit more intelligent in it’s behaviour.

Not filtering the broadcast initially doesn’t break DHCP snooping. But it’s completely unnecessary. Why send a request out a port that you would filter a reply? I’ve seen switches reload and suddenly all devices on the switch try to get their IPs back. All devices receive all these broadcasts when only the trusted port should receive it. Filtering ensures less broadcasts on the network and also prevents badly configured devices from replying to a packet it should never have received.

Fundamentals – PMTUD – IPv4 & IPv6 – Part 1 of 2

One of IPv6’s features is the fact that routers are no longer supposed to fragment packets. Rather it’s up to the hosts on either end to work out the path MTU. This is different in IPv4 in which the routers along the path could fragment the packet. Both IPv4 and IPv6 have a mechanism to work out the path MTU which is what I’ll go over in this post. Instead of going over each separately, I’ll show what problem is trying to be solved and how both differ when it comes to sending traffic.

I’ll be using the following topology in this post:
pmtu-1

The problem

When you visit this blog, your browser is requesting a particular web page from my server. This request is usually quite small. My server needs to respond to that packet with some actual data. This includes the images, words, plugins, style-sheets, etc. This data can be quite large. My server needs to break down this stream of data into IP packets to send back to you. Each packet requires a few headers, and so the most optimum way to send data back to you is the biggest amount of data in the smallest amount of packets.

Between you and my server sits a load of different networks and hardware. There is no way for my server to know the maximum MTU supported by all those devices along the path. Not only can this path change, but I have many thousands of readers in thousands of different countries. In the topology above, the link between R2 and R4 has an MTU of 1400. None of the hosts are directly connected to that segment and so none of them know the MTU of the entire path.
pmtu-2

PMTUD

Path MTU Discovery, RFC1191 for IPv4 and RFC1981 for IPv6, does exactly what the name suggests. Find out the MTU of the path. There are a number of similarities between the two RFCs, but a few key differences which I’ll dig into.

Note – OS implementations of PMTUD can vary widely. I’ll be showing both Debian Linux server 7.6.0 and Windows Server 2012 in this post.

Both RFCs state that hosts should always assume first that the MTU across the entire path matches the first hop MTU. i.e. The servers should assume that the MTU matches the MTU on the link they are connected. In this case both my Windows and Linux servers have a local MTU of 1500.
pmtu-3

The link between R1 and R4 has an IP MTU of 1400. My servers would need to figure the path MTU in order to maximise the packet size without fragmentation.

  • IPv4
  • RFC1191 states:

    The basic idea is that a source host initially assumes that the PMTU of a path is the (known) MTU of its first hop, and sends all datagrams on that path with the DF bit set. If any of the datagrams are too large to be forwarded without fragmentation by some router along the path, that router will discard them and return ICMP Destination Unreachable messages with a code meaning “fragmentation needed and DF set” [7]. Upon receipt of such a message (henceforth called a “Datagram Too Big” message), the source host reduces its assumed PMTU for the path.

    In my example, the servers should assume that the path MTU is 1500. They should send packets back to the user using this MTU and setting the Do Not Fragment bit. R2’s link to R4 is not big enough and so should drop the packet and return the correct ICMP message back to my servers. Those servers should then send those packets again with a lower MTU.

    I’m going to show Wireshark capture from the servers point of view. I’ll start with Windows.

    The first part is the regular TCP 3-way handshake to set up the session. These packets are very small so are generally not fragmented:
    Screen Shot 2014-08-25 at 12.37.40
    The user then requests a file. The server responds with full size packets with the DF bit set. Those packets are dropped by R2, who sends back the required ICMP message:
    Screen Shot 2014-08-25 at 12.39.49

    Dig a bit deeper into those packets. First the full size packet from the server. Note the DF-bit has been set:
    Screen Shot 2014-08-25 at 12.43.49

    Second, the ICMP message sent from R2. This is an ICMP Type 3 Code 4 message. It states the destination is unreachable and that fragmentation is required. Note it also states the MTU of the next-hop. The Windows server can use this value to re-originate it’s packets with a lower MTU.

    All the rest of the packets in the capture then have the lower MTU set. Note that Wireshark shows the ethernet MTU as well hence the value of 1414:
    Screen Shot 2014-08-25 at 12.49.11

    RFC1191 states that a server should cache a lower MTU value. It’s also suggested that this value is cached for 10 minutes, and should be tweakable. You can view the cached value on Windows, but it doesn’t show the timer. Perhaps a reader could let me know?
    Screen Shot 2014-08-25 at 12.53.53

    I’ll now do the same on my Debian server. First part is the 3-way handshake again:
    Screen Shot 2014-08-26 at 1.27.02 pm
    The server starts sending packets with an MTU of 1500:
    Screen Shot 2014-08-26 at 1.28.48 pm
    Which are dropped by R2, with ICMP messages sent back:
    Screen Shot 2014-08-26 at 1.29.52 pm
    The Debian server will cache that entry. Debian does show me the remaining cache time, in this case 584 seconds:
    Screen Shot 2014-08-26 at 1.32.23 pm

  • IPv6
  • RFC1981 goes over the finer details of how this works with IPv6. The majority of the document is identical to the RFC1191 version.

    When the Debian server responds, the packets have a size of 1514 on the wire as expected. Note however that there is no DF bit in IPv6 packets. This is a major difference between IPv4 and IPV6 right here. Routers CANNOT fragment IPv6 packets and hence there is no reason to explicitly state this in the packet. All IPv6 packets are non-fragmentable by routers in the path. I’ll go over what this means in depth later.
    Screen Shot 2014-08-27 at 8.06.39 am
    R2 cannot forward this packet and drops it. The message returned by R2 is still an ICMP message, but it’s a bit different to the IPv4 version:
    Screen Shot 2014-08-27 at 8.10.56 am

    This time the message is ‘Packet too big’ – Very easy to figure out what that means. The ICMP message will contain the MTU of the next-hop as expected:
    Screen Shot 2014-08-27 at 8.14.02 am

    The server will act on this message, cache the result, then send packets up to the required MTU:
    Screen Shot 2014-08-27 at 8.17.30 am
    Screen Shot 2014-08-27 at 8.18.29 am

    Windows server 2012 has identical behaviour. To show the cache simply view the ipv6 destinationcache and you’re good to go.

    Problems

    So what could possibly go wrong? The above all looks good and works in the lab. The biggest issue is that both require those ICMP messages to come back to the sending host. There are a load of badly configured firewalls and ACLs out there dropping more ICMP than they are supposed to. Some people even drop ALL ICMP. There is another issue that I’ll go over in another blog post in the near future.

    In the above examples, if those ICMP messages don’t get back, the sending host will not adjust it’s MTU. If it continues to send large packets, the router with a smaller MTU will drop that packet. All that traffic is blackholed. Smaller packets like requests will get through. Ping will even get through if echo-requests and echo-replies have been let through. You might even be able to see the beginnings of a web page, but the big content will not load.

    On R1’s fa0/1 interface I’ll create this bad access list:

    R1#sh ip access-lists
    Extended IP access list BLOCK-ICMP
        10 permit icmp any any echo
        20 permit icmp any any echo-reply
        30 deny icmp any any
        40 permit ip any any

    From the client I can ping the host:
    Screen Shot 2014-08-27 at 8.31.41 am
    I can even open a text-based page from the server:
    Screen Shot 2014-08-27 at 8.32.30 am

    But try to download the file:
    Screen Shot 2014-08-27 at 8.33.39 am
    The initial 3-way handshake works fine, but nothing else happens. The Debian server is sending those packets, R2 is dropping and informing the sender, but R1 drops those packets. You’ve now got a black-hole. The same things happens with IPv6, though of course the packet dropped is the Packet Too Big message.

    Workarounds

    The best thing to do is fix the problem. Unfortunately that’s not always possible. There are a few things that can be done to work through the problem of dropped ICMP packets.
    If you know the MTU value further down the line, you can use TCP clamping. This causes the router to intercept TCP SYN packets and rewrite the TCP MSS. You need to take into account the size of the added headers.

    1#conf t
    Enter configuration commands, one per line.  End with CNTL/Z.
    R1(config)#int fa1/1
    R1(config-if)#ip tcp adjust-mss  1360
    R1(config-if)#end

    Note how the MSS value has been changed to 1360:
    Screen Shot 2014-08-28 at 1.46.58 pm

    I’ve tested with IOS 15.2(4)S2 and it also works with IPv6:
    Screen Shot 2014-08-28 at 1.54.57 pm

    The problem with this is that it’s a burden on the router configured. Your router might not even support this option. This also affects ALL TCP traffic going through that router. TCP clamping can work well for VPN tunnels, but it’s not a very scalable solution.

    Another workaround can be to get the router to disregard the DF bit and just let the routers fragment the packets:

    route-map CLEAR-DF permit 10
     set ip df 0
    !
    interface FastEthernet1/1
     ip address 192.168.4.1 255.255.255.0
     ip router isis
     ip policy route-map CLEAR-DF
     ipv6 address 2001:DB8:10:14::1/64
     ipv6 router isis

    The problem with this is that you’re placing burden on the router again. It’s also not at all efficient. Some firewalls also block fragments. Some routers might just drop fragmented packets.
    The biggest problem with this is that there is no df-bit to clear in IPv6. IPv6 packets will not be fragmented by routers. It has to be done by the host.

    End of Part One

    There is simply too much to cover in a single post. I’ll end this post here. Part two will be coming soon!

    The dangers of ignoring OSPF MTU

    Quite often I see ip ospf mtu-ignore configured when two router’s MTU have a mismatch. This is bad. To demonstrate why I’ll use the following simple topology:

    Let’s create a simple area 0 point-to-point adjacency between the two routers and make R1’s MTU slightly larger. Then ignore OSPF MTU otherwise the adjacency will not come up:

    R1
    interface GigabitEthernet1/0
     mtu 2000
     ip address 10.0.0.1 255.255.255.0
     ip ospf network point-to-point
     ip ospf mtu-ignore
     ip ospf 1 area 0
    

    The adjacency is fine as far as we can see:

    R1#sh ip ospf neighbor | beg Nei
    Neighbor ID     Pri   State           Dead Time   Address         Interface
    1.2.3.255         0   FULL/  -        00:00:30    10.0.0.1        GigabitEthernet1/0

    Now I’ve added 256 loopback interfaces onto R1 and put them all into OSPF by using network 0.0.0.0 0.0.0.0 area 0. This means all those loopback interfaces will be part of the type1 LSA originated by R1. What happens though?

    interface Loopback1
     ip address 1.2.3.1 255.255.255.255
    !
    interface Loopback2
     ip address 1.2.3.2 255.255.255.255
    !
    interface Loopback3
    !
    [etc etc etc]
    !
    router ospf 1
     network 0.0.0.0 0.0.0.0 area 0

    At first, nothing seems wrong. But take a look at the database from R1 and R2’s perspective. Remember the database should be identical.

    R1#sh ip ospf database
    
                OSPF Router with ID (1.2.3.255) (Process ID 1)
    
                    Router Link States (Area 0)
    
    Link ID         ADV Router      Age         Seq#       Checksum Link count
    1.2.3.255       1.2.3.255       33          0x80000005 0x00767B 257
    10.0.0.2        10.0.0.2        100         0x80000011 0x00C816 2
    R2#sh ip ospf database
    
                OSPF Router with ID (10.0.0.2) (Process ID 1)
    
                    Router Link States (Area 0)
    
    Link ID         ADV Router      Age         Seq#       Checksum Link count
    1.2.3.255       1.2.3.255       130         0x80000004 0x00856D 2
    10.0.0.2        10.0.0.2        128         0x80000011 0x00C816 2

    R1 sees a link count of 257 for R1s router LSA, while R2 only sees 2. This can be confimred by seeing that R2 doesn’t have any OSPF routers to R1’s loopback:

    R2#sh ip route ospf | beg Gate
    Gateway of last resort is not set
    
    

    If you wait a while you’ll see LOADING on the adjacency too. And eventually the adjacency resets and tries again:

    R2#sh ip ospf neighbor
    
    Neighbor ID     Pri   State           Dead Time   Address         Interface
    1.2.3.255         0   LOADING/  -     00:00:32    10.0.0.1        GigabitEthernet1/0
    R2#
    *Feb  6 19:11:26.958: %OSPF-5-ADJCHG: Process 1, Nbr 1.2.3.255 on GigabitEthernet1/0 
    from LOADING to DOWN, Neighbor Down: Too many retransmissions

    So what exactly is happening? If you check Wireshark you’ll see the issue straight away


    OSPF does not do any sort of path MTU discovery. R1 is attempting to send a type1 LSA and it’s using an MTU size of 2000. R2 cannot receive that large a frame and so those fragments get dropped. R2 never acknowledges the LSA as it’s not receiving anything, and eventually that causes the adjacency to reset. This then continues over and over.

    This could be hidden though. Let’s stop R1 advertising all those addresses via it’s type1 LSA and instead redistribute the links into OSPF:

    R1(config)#router ospf 1
    R1(config-router)#no network 0.0.0.0 0.0.0.0 area 0
    R1(config-router)#redistribute connected subnet
    R1(config-router)#end
    
    R2#sh ip ospf neighbor
    
    Neighbor ID     Pri   State           Dead Time   Address         Interface
    1.2.3.255         0   FULL/  -        00:00:38    10.0.0.1        GigabitEthernet1/0
    R2#sh ip route ospf | beg Gate
    Gateway of last resort is not set
    
          1.0.0.0/32 is subnetted, 255 subnets
    O E2     1.2.3.1 [110/20] via 10.0.0.1, 00:01:14, GigabitEthernet1/0
    O E2     1.2.3.2 [110/20] via 10.0.0.1, 00:01:14, GigabitEthernet1/0
    O E2     1.2.3.3 [110/20] via 10.0.0.1, 00:01:14, GigabitEthernet1/0
    O E2     1.2.3.4 [110/20] via 10.0.0.1, 00:01:14, GigabitEthernet1/0
    O E2     1.2.3.5 [110/20] via 10.0.0.1, 00:01:14, GigabitEthernet1/0
    O E2     1.2.3.6 [110/20] via 10.0.0.1, 00:01:14, GigabitEthernet1/0
    O E2     1.2.3.7 [110/20] via 10.0.0.1, 00:01:14, GigabitEthernet1/0
    O E2     1.2.3.8 [110/20] via 10.0.0.1, 00:01:14, GigabitEthernet1/0
    O E2     1.2.3.9 [110/20] via 10.0.0.1, 00:01:14, GigabitEthernet1/0
    [etc]

    This time it works, even with a mismatched MTU. Why? A type 5 LSA only has space for a single address. This means that R1 originates 255 type 5 LSAs and each of those LSA are much much smaller than 2000 bytes. This means that the LSA updates are not bigger than 1500 bytes and so we never have R2 dropping any of those packets.

    A router only originates a single router LSA, and that single LSA has to contain all the interface addresses for that router that is enabled for OSPF in the area. If a router has 1000 interfaces, well that’s a large type1.

    You can see the individual type5s in the database itself:

    R2#sh ip ospf database | beg External
                    Type-5 AS External Link States
    
    Link ID         ADV Router      Age         Seq#       Checksum Tag
    1.2.3.1         1.2.3.255       390         0x80000001 0x00692A 0
    1.2.3.2         1.2.3.255       390         0x80000001 0x005F33 0
    1.2.3.3         1.2.3.255       390         0x80000001 0x00553C 0
    1.2.3.4         1.2.3.255       390         0x80000001 0x004B45 0
    1.2.3.5         1.2.3.255       390         0x80000001 0x00414E 0
    1.2.3.6         1.2.3.255       390         0x80000001 0x003757 0
    1.2.3.7         1.2.3.255       390         0x80000001 0x002D60 0
    1.2.3.8         1.2.3.255       390         0x80000001 0x002369 0
    1.2.3.9         1.2.3.255       390         0x80000001 0x001972 0
    1.2.3.10        1.2.3.255       390         0x80000001 0x000F7B 0
    [etc etc]

    Out of interest, type3, type4, type5, and type7 LSAs all follow the ‘single address per LSA’ model and as such should never be that big. A type2 LSA will expand to reflect the amount of routers on the layer 2 segment, but I find it hard to believe that there would be over 100 routers on a single segment (though not impossible)

    By the way, I wrote a separate post explaining a few more in-depth spf considerations when it comes to type1s and type5s over here: OSPF – Type 1 LSA vs Type 5 LSA (passive vs redistribute)

    So there you have it. Ignore the MTU at your own peril. Rather fix the MTU issue than just ignoring it. It’s something that might not be an issue ‘now’ but as your router LSA grows in size you suddenly run into a problem.

    Next-Hop IP. What does it actually mean?

    I’ve seen far too much confusion about the fundamentals of IP routing that I thought it would be good to write something like this.

     

    If packets are getting sent to a default gateway, or next-hop, whatever – is that packet actually addressed to that next-hop? Well, it depends on what layer we are talking about. From a layer 3 perspective it’s never actually addressed to that next-hop. i.e. the source and destination IP address NEVER changes, unless you have some sort of device doing NAT.

     

    The next-hop address is merely an address that you are hoping that this packet goes towards. If the next-hop is on the same subnet as the source address, than an ARP resolution will take place and that packet will get sent to the gateway’s MAC address. The destination IP has not changed at all.

     

    If the next-hop is NOT on the same subnet, that packet will travel to the local gateway and then onwards. That gateway might have another idea where that packet should go to as, again, the packet is not actually addresses to that next-hop via layer3.

    This also means that a next-hop address could even be an address that doesn’t exist. As long as the packet travels in the right direction you are good to go.

     

    Let’s take the following diagram as an example:

    R2 and R3 are running OSPF with each other. R3 has a loopback of 3.3.3.3 advertised into OSPF so R2 knows how to get there. R1 and R2 are not running OSPF. R2 is advertising the R1 and R2 link into OSPF as a stub network.

    The actual subnets used are 10.12.12.0/24 and 10.23.23.0/24
    R1:

    interface FastEthernet0/0
     ip address 10.12.12.1 255.255.255.0
    

    R2:

    interface FastEthernet0/0
     ip address 10.12.12.2 255.255.255.0
     ip ospf 1 area 0
    !
    interface FastEthernet0/1
     ip address 10.23.23.2 255.255.255.0
     ip ospf network point-to-point
     ip ospf 1 area 0
    !
    router ospf 1
     passive-interface FastEthernet0/0

    R3:

    interface Loopback0
     ip address 3.3.3.3 255.255.255.255
     ip ospf 1 area 0
    !
    interface FastEthernet0/0
     ip address 10.23.23.3 255.255.255.0
     ip ospf network point-to-point
     ip ospf 1 area 0

    On R3 I now set an IP route to 3.3.3.3/32 with a next-hop of 192.168.1.1, which does not exist anywhere. I then create another route to 192.168.1.1/32 with a next-hop of 10.12.12.2

    ip route 3.3.3.3 255.255.255.255 192.168.1.1
    ip route 192.168.1.1 255.255.255.255 10.12.12.2

    Let’s have a look at the route table on R1:

    R1#sh ip route 3.3.3.3
    Routing entry for 3.3.3.3/32
      Known via "static", distance 1, metric 0
      Routing Descriptor Blocks:
      * 192.168.1.1
          Route metric is 0, traffic share count is 1
    
    R1#sh ip route 192.168.1.1
    Routing entry for 192.168.1.1/32
      Known via "static", distance 1, metric 0
      Routing Descriptor Blocks:
      * 10.12.12.2
          Route metric is 0, traffic share count is 1

    As expected everything works fine:

    R1#ping 3.3.3.3
    
    Type escape sequence to abort.
    Sending 5, 100-byte ICMP Echos to 3.3.3.3, timeout is 2 seconds:
    !!!!!
    Success rate is 100 percent (5/5), round-trip min/avg/max = 13/23/29 ms

    However the above example is probably not the best example as CEF would already have worked out all the recursive routing needed:

    R1#sh ip cef 3.3.3.3
    3.3.3.3/32, version 7, epoch 0, cached adjacency 10.12.12.2
    0 packets, 0 bytes
      via 192.168.1.1, 0 dependencies, recursive
        next hop 10.12.12.2, FastEthernet0/0 via 192.168.1.1/32
        valid cached adjacency

    But it does prove that the packet is able to get to 3.3.3.3 even with a next-hop that does not actually exist anywhere.

    Let’s now make a more complicated scenario:

    My subnet addressing is similar to before. This time R5 is advertising it’s loopback interface into OSPF. R1 is NOT running OSPF.

    R1 has a static route that says to get to 5.5.5.5 it needs to send it to R3. It then has a route to R3 via R2.

    R1#sh ip route 5.5.5.5
    Routing entry for 5.5.5.5/32
      Known via "static", distance 1, metric 0
      Routing Descriptor Blocks:
      * 10.23.23.3
          Route metric is 0, traffic share count is 1
    

    But what happens when I traceroute from R1?

    R1#traceroute 5.5.5.5
    Type escape sequence to abort.
    Tracing the route to 5.5.5.5
      1 10.12.12.2 52 msec 76 msec 4 msec
      2 10.24.24.4 80 msec 72 msec 68 msec
      3 10.45.45.5 188 msec *  84 msec

    The traffic gets to my destination, but it did not ever get near R3. Why is that?
    Have a look at R2:

    R2#sh ip route 5.5.5.5
    Routing entry for 5.5.5.5/32
      Known via "static", distance 1, metric 0
      Routing Descriptor Blocks:
      * 10.24.24.4
          Route metric is 0, traffic share count is 1

    I put a static route on R2 to send traffic for 5.5.5.5 via R4, not R3.

    So all in all really simple. What I’m merely trying to show is that in regular routing, each and every hop along the way will make their own independent decision on how to get to the destination. When that packet gets to R2, it has no idea that R1 wanted to actually go via R3, because that next-hop is not encoded anywhere. All R1 is doing is sending traffic ‘towards’ the next-hop. R2 will makes it’s own decision as it only sees the destination address of 5.5.5.5

    This behaviour fully explains routing-loops and the problem of traffic getting dropped inside an AS running BGP

    Restricting users to only view parts of the SNMP tree – Cisco

    It’s well known that you can give your customer read-only access to the SNMP tree, but are you sure you want to give them that much information? Even though they can’t change anything, they are able to extract the full configuration, the full routing table and much much more.

    As a test I set up SNMP read-only access to a Cisco box I have and ran a full snmpwalk on it. I extracted over 8Mb worth of text data, including full routing tables; ARP tables; OSPF tables etc…

    Not only that, but while I was running the walk my device CPU was sitting pretty high:

    Router#sh proc cpu sorted
    CPU utilization for five seconds: 33%/3%; one minute: 76%; five minutes: 54%
    PID Runtime(ms)     Invoked      uSecs   5Sec   1Min   5Min TTY Process
    210      121148      106996       1132 15.11% 44.65% 25.56%   0 SNMP ENGINE
    107       70240      213991        328  7.35% 12.99% 11.51%   0 IP SNMP

    Walking the entire SNMP tree also took almost 5 minutes.

    So do you really want your customer to know that much? And secondly do you really want your customers monitoring system polling your devices for everything while your device sits with high CPU all the time?

    I was testing with a few views this morning and came up with the following:

    snmp-server view RESTRICT iso included
    snmp-server view RESTRICT at.* excluded
    snmp-server view RESTRICT ip.* excluded
    snmp-server view RESTRICT ospf.* excluded
    snmp-server community [community] view RESTRICT RO [acl]

    When I polled using this community it took less than 5 seconds and gave me pretty much all the information I would want to give the customer. Be sure to restrict the protocol you’re actually using. I have restricted OSPF above.

    Out of interest, an snmpwalk on my edge BGP router gives me a text file of 0.5GB!