Categories
Admin Network Technologies

What is the one DHCP problem managed network providers never recognize?

Answer: the one where their switch eats the DHCPDISCOVER packets. And the amazing thing is they never learn. And the second amazing thing is that they actually don’t apply the most basic networking debugging techniques when such a problem occurs. I’m talking your basic, DHCPDISCOVER packet goes to your switch, same DHCPDISCOVER packet never arrives to the DHCP server on same switch. We know it to be the case, but, to help convince yourself that your switch is eating the packets, do networky things like create a span port of the DHCP server’s port to prove to yourself that no DHCP requests are coming in. And yet, they are never prepared to do that, to propose that. So instead indirect proxies are used to draw the conclusion.

I’ve been involved in three of four such debugging sessions. They take hours. I took notes when it happened again this weekend. I guess that setup is pretty typical of how it plays out. A data center was moved, including a DHCP server. The new data center has a MAN network to the old one. All IPs were preserved. When they turned on the moved DHCP server DHCP lease were no longer getting handed out. In fact it was worse than that. With the moved DHCP sever turned off, most DHCP leases were working. But with it on, that’s when things really began to go south!

Here’s the switch port they noted for the iDRAC:

sh run int gi1/0/24
Building configuration…
Current configuration : 233 bytes
!
interface GigabitEthernet1/0/24
description --- To-cnshis01-iDRAC - iDRAC
switchport access vlan 202
switchport mode access
logging event link-status
speed 100
duplex full
spanning-tree portfast
ip dhcp snooping trust
end

The first line of course if the IOS command. OK, so they had that on the iDRAC, right. But on the actual server port they had this:

sh run int gi1/0/23
Building configuration…
Current configuration : 258 bytes
!
interface GigabitEthernet1/0/23
description --- To-cnshis01-Gb1 - Gb1
switchport access vlan 202
switchport mode access
logging event link-status
spanning-tree portfast
service-policy input PMAP_COS_REMARK_IN
service-policy output PMAP_COS_OUT
end

I basically told them cheekily up front that this is usually a network switch problem and that they have to play with the DHCP snooping enable setting.

And I have to say that the usual hours of debugging were short-circuited this time as they seemed to believe me, and simply experimented by adding

ip dhcp snooping trust

to the DHCP server’s main port. We immediately began seeing DHCPDISCOVER pakcets come in to the DHCP server, and the team testified that people were getting leases.

Final mystery explained

Now why were things behaving really badly – no leases – when the DHCP server was up but no DHCPDISCOVER requests were getting to it? I have the explanation for that as well. You see there is a standby DHCP server which is designed for failure of the primary DHCP server. But not for this type of failure! That’s right. There is an out-of-band (by that I mean not carried over DHCP ports like UDP port 67) communication between standby and primary which tells the standby Hey, although you got this DHCPDISCOVER request, ignore it becasue the primary is active and will serve it! And meanwhile, as we have said, the primary wasn’t getting the requests at all. Upshot: no one gets leases.

Just to mention it

My second-to-last debugging session of this sort was a little different. There they mentioned that there was a “global setting” which governed this DHCP snooping on the switch. So they had to do something with that (enable or disable or something). So there was no issue with the individual switch ports. For me that’s just a variation on the same theme.

2023 debugging session

Well, nothing has really changed two years after I originslly posted this article and I get on these troubleshooting sessions with the vendor, people from the firewall team, vendor management people – it’s quite an affair. And as before it always takes a minimum of a couple hours for the network vendor to find their mistakes in their configuration. I have just done two of these sessions in the last week.

The last one was a tad different. There was a firewalled segment. The PCs behind it were not receiving IPs from the dhcp server. It is worth mentioning. Someone suggested a traceroute from the dhcp server to this subnet. And then a traceroute to another subnet (non-firewalled) at the same site which was working. They looked completely different after the first few hops! So about an hour after that they found that they had forgotten to add a route for this subnet pointing to the firewall. And that makes sense in that on the dhcp server – unlike in most cases – I was see the DHCP DISCOVER and it was replying with a DHCP OFFER. But that DHCP OFFER was simply not getting to the firewall at the site.

Cute Mnemonic to remember the four DHCP phases

Can you never remember the phases of the DHCP protocol like me? Then remember only this: DORA.

  • Discover
  • Offer
  • Request
  • Ack

What’s the idea behind this feature?

Having done a total of zero minutes of research on the topic, I will anyway weigh in with my opinion! Suppose someone comes along and plugs in a consumer grade home router into your network. It’s probably going to act as a rogue DHCP server. Imagine the fun trying to debug that situation? We’ve all been there… These rogue devices are probably fairly common. So if your corporate switch doesn’t suppress certain DHCP packets from ports where they are not expected, then this rogue device will begin to take down your subnet and totally bewilder everyone. I imagine this setting that is the topic of this blog post stems from trying to suppress all unknown DHCP packets in advance. Its just that sometimes the setting is taken too far and, e.g., a firewall which relays DHCP requests is also getting its DHCP packets suppressed.

Conclusion

I normally would have presented this as part of my IT Detective series. But I feel this is more like a lament about the sad state of affairs with our network providers. And though I’ve seen this issue about four times in the past 12 months, they always act like they have no idea what we’re talking about. They’ve never encountered this problem. They have no idea how to fix it. And they have no idea how to further debug it.. What steps does the customer wish?

References and related

Juat because I mentioned it, here’s on of those IT Detective Agency blog posts: The IT Detecive Agency: web site not accessible

Categories
Consumer Interest Consumer Tech Network Technologies Raspberry Pi

Consumer Tech: Home Internet stopped working

Intro

We woke up yesterday to no Internet. The usual remedies consumers go through did nothing to resolve the issue. What to do?

The details – November 25, 2020

The usual restarts or my router and the cable modem did not work. I plugged in my work laptop directly to the cable modem for some quick tests but that did not work.

I plugged my work-issued VPN router directly to the cable modem and it did not pick up an IP and re-establish the tunnel.

When I logged into my router I saw that its WAN IP was listed as 0.0.0.0, which means none at all.

I called the ISP twice. Both time they said they could “see” my modem, and they tried to restart it on their end, but that did not seem to do anything at all, based on the constant status LEDs (see picture below). I got my service visit moved up from Dec 11th to Dec 2nd, but still that would mean a week without Internet – not so great when three people are relying on it for their work.

I rebooted the cable modem a couple times at least. Nothing changed.

Then I started some research on quickie alternatives. Ask a friend from work for a spare Cradlepoint air card? They’re already out on vacation. Get a Chinese-made unlocked hotspot with pre-purchased data? Seems fishy, and ultimately expensive. Verizon brand hotspot? We had a borrowed one. Very finicky. And no ethernet ports.

Raspberry Pi + DIY approach?

At one point in the evening, convinced I would have to wait days for for a visit from the cable guy, I rigged up a spare Raspberry Pi to act as a router between a mobile hotspot (a companion tablet to a Verizon phone) and my Linksys router. Why bother? Why not just use the hotspot directly? Mostly because it’s a pain in the rear to reprogram all those Internet of Things devices one has in ones home these days, notably the several Echo Dots, but as well, a wireless printer, a few laptops, Firesticks, tablets, etc. With this approach I keep the WiFi SSID as it was for all those devices. And, it sort of worked! At least I got one Echo Dot to work. I didn’t push my luck. This stuff consumes a lot of data, even when “idle.”

To be continued…

Linksys WRT1200AC status lights – when healthy!
Cable Modem tatus lights – when operating normally

But I am pretty good at troubleshooting. What I know that less experienced people may not is that all the testing I’ve done to that point was not ironclad proof of failure of the cable modem. I know the traditional advice of old is to hook up a laptop directly to the ethernet port and work with it that way. Furthermore the cable company support said that my status lights were reading normally. So, when I tested my work laptop? Are you kidding? That thing has so many problems when I switch between SSIDs due to some new security software – it loves to display the Globe in the system tray, and the only recourse is to reboot. That’s what I was seeing, but notice I said a quickie test? I did not have time to do that reboot and all that. And that work-issued VPN router? I don’t know how that thing really works either. Never having set it up that way I did not trust reading too much into its results (which was essentially an orange status light instead of the usual white).

So when I had more time in the evening, I hooked up a home laptop which I know should work. After a cable modem reboot in fact I did get an IP and could surf the Internet. That was a glimmer of hope. So I put my router back in place. Still it did not pick up an WAN IP address. Still reading 0.0.0.0 for its IP.

Then I put the laptop back, writing down the IP, subnet mask and default gateway. Then I put my router back, switched its WAN mode from DHCP to fixed IP, putting on the exact IP address the laptop had picked up, with correct subnet mask and default gateway. Still it was not working. When the router is not working the WAN status light is sort of orange-ish. It’s white (pictured above) when the WAN link is communicating.

I decided the fault should lie more with my router than anywhere else, and since it wasn’t working and no number of power cycles was changing that situation, I decided that a factory reset is the thing to try. The last thing I could try. I noted the exact name and passwords of my SSIDs, held the reset button for 15 seconds until the status lights flicked out, and let it start up. It went through a start-up process, which i saw after connecting to its default IP of 192.168.1.1. It was clear it was not seeing the cable modem at the point where it should, but it had some very specific advice to try: power off cable modem, wait two minutes, power it back on, and then it would try again. And that did work! Yeah!

What may have precipitated this

My local cable company was recently bought by a much bigger company. I know for a fact what my WAN IP used to be, and I see it has changed. They now draw from a giant pool of IPs – a /14 in CIDR notation – that’s 262,000 addresses – that belongs to the new owner. So I believe the problem occurred due to a poor implementation of the dhcp protocol within my router, or a poor interplay between my router’s DHCP client and the ISP’s DHCP server. But I can’t research that line of troubleshooting because the ISP’s DHCP policies would require a lot of time-consuming experimentation on my part to reverse engineer based on observed behaviour under different conditions. And I would need an open source DHCP client – but I have the Raspberry Pi running dnsmasq for that, so that end could gather all the needed client information.

Prior to this acquisition I would tend to keep the same WAN IP for years – that’s how stable it was.

Another approach

Very germane to this topic is the fact that my neighbor down the street experienced his own Internet outage the day after I did! His solution was to buy a better cable modem. I did not know you could do that – I thought they were proprietary. He also saw his router with the 0.0.0.0 WAN address. And his approach also worked. This makes me less sure my router was really at fault – maybe Altice screwed up their DHCP service for half a day.

Conclusion

Unusual for me, I’m going to write the conclusion before writing the tedious part which is the full explanation in the middle.

By the end of the day I got the Internet working. After isolating the problem to my home router, the Linksys WRT1200AC, and determining that any amount of power cycling was not clearing things up, a factory reset did the trick! The cable modem and my cable Internet service was fine all along.

References and related

How to turn your Raspberry Pi into a router which shares your hotspot with your home router.

The Linksys WRT1200AC is no longer sold. It looks like the newer version is the WRT1900AC – it even looks identical. It’s a good router. I know there are fancier solutions out there, but there are also worse ones as well, so I can only give my qualified endorsement: https://www.amazon.com/Linksys-AC1900-Source-Wireless-WRT1900AC/dp/B014MIBLSA/ref=sr_1_1?dchild=1&keywords=linksys+wrt1200ac&qid=1606519765&sr=8-1

DHCP and CIDR notation are both described in great detail in their respective Wikipedia articles.

Categories
Admin Perl

Counting active leases on an old ISC DHCP server

Intro
Checkpoint Gaia offers a DHCP service, but it ias based on a crude and old dhcp daemon implementation frmo ISC. Doesn’t give you much. Mostly just the file /var/lib/dhcpd/dhcpd.leases, which it constantly updates. A typical dhcp client entry looks like this:

 
lease 10.24.69.22 {
  starts 5 2018/11/16 22:32:59;
  ends 6 2018/11/17 06:32:59;
  binding state active;
  next binding state free;
  hardware ethernet 30:d9:d9:20:ca:4f;
  uid "\0010\331\331 \312O";
  client-hostname "KeNoiPhone";
}


The details

So I modified a perl script to take all those lines and make sense of them.
I called it lease-examine.pl.
Here it is

#!/usr/bin/perl
# from https://askubuntu.com/questions/219609/how-do-i-show-active-dhcp-leases - DrJ 11/15/18
 
my $VERSION=0.03;
 
##my $leases_file = "/var/lib/dhcpd/dhcpd.leases";
my $leases_file = "/tmp/dhcpd.leases";
 
##use strict;
use Date::Parse;
 
my $now = time;
##print $now;
##exit;
# 12:22 PM 11/15/18 EST
#my $now = "1542302555";
my %seen;       # leases file has dupes (because logging failover stuff?). This hash will get rid of them.
 
open(L, $leases_file) or die "Cant open $leases_file : $!\n";
undef $/;
my @records = split /^lease\s+([\d\.]+)\s*\{/m, <L>;
shift @records; # remove stuff before first "lease" block
 
## process 2 array elements at a time: ip and data
foreach my $i (0 .. $#records) {
    next if $i % 2;
    ($ip, $_) = @records[$i, $i+1];
    ($ip, $_) = @records[$i, $i+1];
 
    s/^\n+//;     # && warn "leading spaces removed\n";
    s/[\s\}]+$//; # && warn "trailing junk removed\n";
 
    my ($s) = /^\s* starts \s+ \d+ \s+ (.*?);/xm;
    my ($e) = /^\s* ends   \s+ \d+ \s+ (.*?);/xm;
 
    ##my $start = str2time($s);
    ##my $end   = str2time($e);
    my $start = str2time($s,UTC);
    my $end   = str2time($e,UTC);
 
    my %h; # to hold values we want
 
    foreach my $rx ('binding', 'hardware', 'client-hostname') {
        my ($val) = /^\s*$rx.*?(\S+);/sm;
        $h{$rx} = $val;
    }
 
    my $formatted_output;
 
    if ($end && $end < $now) {
        $formatted_output =
            sprintf "%-15s : %-26s "              . "%19s "         . "%9s "     . "%24s    "              . "%24s\n",
                    $ip,     $h{'client-hostname'}, ""              , $h{binding}, "expired"               , scalar(localti
me $end);
    }
    else {
        $formatted_output =
            sprintf "%-15s : %-26s "              . "%19s "         . "%9s "     . "%24s -- "              . "%24s\n",
                    $ip,     $h{'client-hostname'}, "($h{hardware})", $h{binding}, scalar(localtime $start), scalar(localti
me $end);
    }
 
    next if $seen{$formatted_output};
    $seen{$formatted_output}++;
    print $formatted_output;
}

Even that script produces a thicket of confusing information. So then I further process it. I call this script dhcp-check.sh:

#!/bin/sh
# DrJ 11/15/18
# bring over current dhcp lease file from firewall FW-1
date
echo fetching lease file dhcpd.leases
scp admin@FW-1:/var/lib/dhcpd/dhcpd.leases /tmp
# analyze it. this should show us active leases
echo analyze dhcpd.leases
DIR=`dirname $0`
$DIR/lease-examine.pl|grep active|grep -v expired > /tmp/intermed-results
# intermed-results looks like:
#10.24.76.124   : "android-7fe22a415ce21c55" (50:92:b9:b8:92:a0)    active Thu Nov 15 11:32:13 2018 -- Thu Nov 15 15:32:13 2018
#10.24.76.197   : "android-283a4cb47edf3b8c" (98:39:8e:a6:4f:15)    active Thu Nov 15 11:37:23 2018 -- Thu Nov 15 15:32:14 2018
#10.24.70.236   : "other-Phone"            (38:25:6b:79:31:60)    active Thu Nov 15 11:32:24 2018 -- Thu Nov 15 15:32:24 2018
#10.24.74.133   : "iPhone-de-Lucia"          (34:08:bc:51:0b:ae)    active Thu Nov 15 07:32:26 2018 -- Thu Nov 15 15:32:26 2018
#exit
# further processing. remove the many duplicate lines
echo count active leases
awk '{print $1}' /tmp/intermed-results|sort -u|wc -l > /tmp/dhcp-active-count
echo count is `cat /tmp/dhcp-active-count`

And that script gives my what I believe is an accurate count of the active leases. I run it every 10 minutes from SiteScope and voila, we have a way to make sure we’re coming close to running out of IP addresses.

Categories
DNS

The IT detective agency: rogue IPv6 device messes up DHCP for entire subnet

Intro
This was a fascinating case insofar as it was my first encounter with a real life IPv6 application. So it was trial by fire.

The details
I think the title of the post makes clear what happened. The site people were saying they can ping hosts by IP but not by DNS name. So basically nothing was working. I asked them to do an ipconfig /all and send me the output. At the top of the list of DNS servers was this funny entry:

IPv6 DNS server shows up first

I asked them to run nslookup, and sure enough, it timed out trying to talk to that same IPv6 server. Yet they could PING it.

The DNS servers listed below the IPv6 one were the expected IPv4 our enterprise system hands out.

My quick conclusion: there is a rogue host on their subnet acting as an IPv6 DHCP server! It took some convincing on my part before they got on board with that idea.

But I goofed too. In my haste to move on, I confused an IPv6 address with a MAC address. Rookie IPv6 mistake I suppose. It looked strange, had letters and even colons, so it kind of looks like a MAC address, right? So I gave some quick advice to get rid of the problem: identify this address on the switch, find its port and disable it. So the guy looked for this funny MAC address and of course didn’t find it or anything that looked like it.

My general idea was right – there was a rogue IPv6 DHCP server.

My hypothesis as to what happened
The PCs have both an IPv4 as well as an IPv6 stack, as does just about everyone’s PC. These stacks run independently of each other. Everyone blissfully ignores the IPv6 communication, but that doesn’t mean it’s not occurring. I think these PCs got an IPv4 IP and DNS servers assigned to them in the usual way. All good. Then along came a DHCPv6 server and the PC’s IPv6 stack sent out a DHCPv6 request to the entire subnet (which it probably is doing periodically all along, there just was never a DHCPv6 server answering before this). This time the DHCPv6 server answered and gave out some IPv6-relevant information, including a IPv6 DNS server.

I further hypothesize that what I said above about the IPv4 and IPv6 stacks being independent is not entirely true. These stacks are joined in one place: the resolving nameservers. You only get one set of resolving namesevrers for your combined IPv4/IPv6 stacks, which sort of makes sense because DNS servers can answer queries about IPv6 objects if they are so configured. So, anyway, the DHCPv6 client decides to put the DNS server it has learned about from its DHCPv6 server at the front of the existing nameserver list. This nameserver is totally busted, however and sits on the request and the client’s error handling isn’t good enough to detect the problem and move on to the next nameserver in the list – an IPv4 nameserver which would have worked just great – despite the fact that it is designed to do just that. And all resolution breaks and breaks badly.

What was the offending device? They’re not saying, except we heard it was a router, hence, a host introduced by the LAN vendor who can’t or won’t admit to having made such an error, instead making a quiet correction. Quiet because of course they initially refused the incident and had us look elsewhere for the source of the “DHCP problem.”

Alternate theory
I see that IPv6 devices do not need to get DNS servers via DHCPv6. They can use a new protocol, NDP, neighbor discovery protocol. Maybe the IPv6 stack is periodically trying NDP and finally got a response from the rogue device and put that first on the list of nameservers. No DHCPv6 really used in that scenario, just NDP.

Useful tips for layer 2 stuff
Here’s how you can find the MAC of an IPv6 device which you have just PINGed:

netsh interface ipv6 show neighbors

from a CMD prompt on a Windows machine.

In Linux it’s

ip ‐6 neigh show

Conclusion
Another tough case resolved! We learn some valuable things about IPv6 in the process.

References and related
I found the relevant commands in this article: https://www.midnightfreddie.com/how-to-arp-a-in-ipv6.html