Answer: the one where their switch eats the DHCPDISCOVER packets. And the zmzaing thing is they never learn. And the second amazing thing is that they actually don’t apply the most basic networing debugging techniques when such a problem occurs. I’m talking your basic, DHCPDISCOVER packet goes to yuor switch, same DHCPDISCOVER packet never arrives to the DHCP server on same switch. We know it to be the case, but, to help convince yourself that your switch is eating the pakcets, do networky things like create a span port of the DHCP server’s port to prove to yourself that no DHCP requests are coming in. And yet, they are never prepared to do that, to propose that. So instead indirect proxies are used to draw the conclusion.
I’ve been involved in three of four such debugging sessions. They take hours. I took notes when it happened again this weekend. I guess that setup is pretty typical of how it plays out. A data center was moved, including a DHCP server. The new data center has a MAN network to the old one. All IPs were preserved. When they turned on the moved DHCP server DHCP lease were no longer getting handed out. In fact it was worse than that. With the moved DHCP sever turned off, most DHCP leases were working. But with it on, that’s when things really began to go south!
Here’s the switch port they noted for the iDRAC:
sh run int gi1/0/24Building configuration… Current configuration : 233 bytes ! interface GigabitEthernet1/0/24 description --- To-cnshis01-iDRAC - iDRAC switchport access vlan 202 switchport mode access logging event link-status speed 100 duplex full spanning-tree portfast ip dhcp snooping trust end
The first line of course if the IOS command. OK, so they had that on the iDRAC, right. But on the actual server port they had this:
sh run int gi1/0/23 Building configuration… Current configuration : 258 bytes ! interface GigabitEthernet1/0/23 description --- To-cnshis01-Gb1 - Gb1 switchport access vlan 202 switchport mode access logging event link-status spanning-tree portfast service-policy input PMAP_COS_REMARK_IN service-policy output PMAP_COS_OUT end
I basically told them cheekily up front that this is usually a network switch problem and that they have to play with the DHCP snooping enable setting.
And I have to say that the usual hours of debugging were short-circuited this time as they seemed to believe me, and simply experimented by adding
ip dhcp snooping trust
to the DHCP server’s main port. We immediately began seeing DHCPDISCOVER pakcets come in to the DHCP server, and the team testified that people were getting leases.
Final mystery explained
Now why were things behaving really badly – no leases – when the DHCP server was up but no DHCPDISCOVER requests were getting to it? I have the explanation for that as well. You see theer is a standby DHCP server which is designed for failure of the primary DHCP server. But not for this type of failure! That’s right. There is an out-of-band (by that I mean not carried over DHCP ports like UDP port 67) communication between standby and primary which tells the standby Hey, although you got this DHCPDISCOVER request, ignore it becasue the primary is active and will serve it! And meanwhile, as we have said, the primary wasn’t getting the requests at all. Upshot: no one gets leases.
Just to mention it
My second-to-last debugging session of this sort was a little different. There they mentioned that there was a “global setting” which governed this DHCP snooping on the switch. So they had to do something with that (enable or disable or something). So there was no issue with the individual switch ports. For me that’s just a variation on the same theme.
What’s the idea behind this feature?
Having done a total of zero minutes of research on the topic, I will anyway weigh in with my opinion! Suppose someone comes along and plugs in a consumer grade home router into your network. It’s probably going to act as a rogue DHCP server. Imagine the fun trying to debug that situation? We’ve all been there… These rogue devices are probably fairly common. So if your corporate switch doesn’t suppress certain DHCP packets from ports where they are not expected, then this rogue device will begin to take down your subnet and totally bewilder everyone. I imagine this setting that is the topic of this blog post stems from trying to suppress all unknown DHCP packets in advance. Its just that sometimes the setting is taken too far and, e.g., a firewall which relays DHCP requests is also getting its DHCP packets suppressed.
I normally would have presented this as part of my IT Detective series. But I feel this is more like a lament about the sad state of affairs with our network providers. And though I’ve seen this issue about four times in the past 12 months, they always act like they have no idea what we’re talking about. They’ve never encountered this problem. They have no idea how to fix it. And they have no idea how to further debug it.. What steps does the customer wish?
References and related
Juat because I mentioned it, here’s on of those IT Detective Agency blog posts: The IT Detecive Agency: web site not accessible