Categories
Consumer Interest Consumer Tech Network Technologies Raspberry Pi

Consumer Tech: Home Internet stopped working

Intro

We woke up yesterday to no Internet. The usual remedies consumers go through did nothing to resolve the issue. What to do?

The details – November 25, 2020

The usual restarts or my router and the cable modem did not work. I plugged in my work laptop directly to the cable modem for some quick tests but that did not work.

I plugged my work-issued VPN router directly to the cable modem and it did not pick up an IP and re-establish the tunnel.

When I logged into my router I saw that its WAN IP was listed as 0.0.0.0, which means none at all.

I called the ISP twice. Both time they said they could “see” my modem, and they tried to restart it on their end, but that did not seem to do anything at all, based on the constant status LEDs (see picture below). I got my service visit moved up from Dec 11th to Dec 2nd, but still that would mean a week without Internet – not so great when three people are relying on it for their work.

I rebooted the cable modem a couple times at least. Nothing changed.

Then I started some research on quickie alternatives. Ask a friend from work for a spare Cradlepoint air card? They’re already out on vacation. Get a Chinese-made unlocked hotspot with pre-purchased data? Seems fishy, and ultimately expensive. Verizon brand hotspot? We had a borrowed one. Very finicky. And no ethernet ports.

Raspberry Pi + DIY approach?

At one point in the evening, convinced I would have to wait days for for a visit from the cable guy, I rigged up a spare Raspberry Pi to act as a router between a mobile hotspot (a companion tablet to a Verizon phone) and my Linksys router. Why bother? Why not just use the hotspot directly? Mostly because it’s a pain in the rear to reprogram all those Internet of Things devices one has in ones home these days, notably the several Echo Dots, but as well, a wireless printer, a few laptops, Firesticks, tablets, etc. With this approach I keep the WiFi SSID as it was for all those devices. And, it sort of worked! At least I got one Echo Dot to work. I didn’t push my luck. This stuff consumes a lot of data, even when “idle.”

To be continued…

Linksys WRT1200AC status lights – when healthy!
Cable Modem tatus lights – when operating normally

But I am pretty good at troubleshooting. What I know that less experienced people may not is that all the testing I’ve done to that point was not ironclad proof of failure of the cable modem. I know the traditional advice of old is to hook up a laptop directly to the ethernet port and work with it that way. Furthermore the cable company support said that my status lights were reading normally. So, when I tested my work laptop? Are you kidding? That thing has so many problems when I switch between SSIDs due to some new security software – it loves to display the Globe in the system tray, and the only recourse is to reboot. That’s what I was seeing, but notice I said a quickie test? I did not have time to do that reboot and all that. And that work-issued VPN router? I don’t know how that thing really works either. Never having set it up that way I did not trust reading too much into its results (which was essentially an orange status light instead of the usual white).

So when I had more time in the evening, I hooked up a home laptop which I know should work. After a cable modem reboot in fact I did get an IP and could surf the Internet. That was a glimmer of hope. So I put my router back in place. Still it did not pick up an WAN IP address. Still reading 0.0.0.0 for its IP.

Then I put the laptop back, writing down the IP, subnet mask and default gateway. Then I put my router back, switched its WAN mode from DHCP to fixed IP, putting on the exact IP address the laptop had picked up, with correct subnet mask and default gateway. Still it was not working. When the router is not working the WAN status light is sort of orange-ish. It’s white (pictured above) when the WAN link is communicating.

I decided the fault should lie more with my router than anywhere else, and since it wasn’t working and no number of power cycles was changing that situation, I decided that a factory reset is the thing to try. The last thing I could try. I noted the exact name and passwords of my SSIDs, held the reset button for 15 seconds until the status lights flicked out, and let it start up. It went through a start-up process, which i saw after connecting to its default IP of 192.168.1.1. It was clear it was not seeing the cable modem at the point where it should, but it had some very specific advice to try: power off cable modem, wait two minutes, power it back on, and then it would try again. And that did work! Yeah!

What may have precipitated this

My local cable company was recently bought by a much bigger company. I know for a fact what my WAN IP used to be, and I see it has changed. They now draw from a giant pool of IPs – a /14 in CIDR notation – that’s 262,000 addresses – that belongs to the new owner. So I believe the problem occurred due to a poor implementation of the dhcp protocol within my router, or a poor interplay between my router’s DHCP client and the ISP’s DHCP server. But I can’t research that line of troubleshooting because the ISP’s DHCP policies would require a lot of time-consuming experimentation on my part to reverse engineer based on observed behaviour under different conditions. And I would need an open source DHCP client – but I have the Raspberry Pi running dnsmasq for that, so that end could gather all the needed client information.

Prior to this acquisition I would tend to keep the same WAN IP for years – that’s how stable it was.

Another approach

Very germane to this topic is the fact that my neighbor down the street experienced his own Internet outage the day after I did! His solution was to buy a better cable modem. I did not know you could do that – I thought they were proprietary. He also saw his router with the 0.0.0.0 WAN address. And his approach also worked. This makes me less sure my router was really at fault – maybe Altice screwed up their DHCP service for half a day.

Conclusion

Unusual for me, I’m going to write the conclusion before writing the tedious part which is the full explanation in the middle.

By the end of the day I got the Internet working. After isolating the problem to my home router, the Linksys WRT1200AC, and determining that any amount of power cycling was not clearing things up, a factory reset did the trick! The cable modem and my cable Internet service was fine all along.

References and related

How to turn your Raspberry Pi into a router which shares your hotspot with your home router.

The Linksys WRT1200AC is no longer sold. It looks like the newer version is the WRT1900AC – it even looks identical. It’s a good router. I know there are fancier solutions out there, but there are also worse ones as well, so I can only give my qualified endorsement: https://www.amazon.com/Linksys-AC1900-Source-Wireless-WRT1900AC/dp/B014MIBLSA/ref=sr_1_1?dchild=1&keywords=linksys+wrt1200ac&qid=1606519765&sr=8-1

DHCP and CIDR notation are both described in great detail in their respective Wikipedia articles.

Categories
Admin Perl

Counting active leases on an old ISC DHCP server

Intro
Checkpoint Gaia offers a DHCP service, but it ias based on a crude and old dhcp daemon implementation frmo ISC. Doesn’t give you much. Mostly just the file /var/lib/dhcpd/dhcpd.leases, which it constantly updates. A typical dhcp client entry looks like this:

 
lease 10.24.69.22 {
  starts 5 2018/11/16 22:32:59;
  ends 6 2018/11/17 06:32:59;
  binding state active;
  next binding state free;
  hardware ethernet 30:d9:d9:20:ca:4f;
  uid "\0010\331\331 \312O";
  client-hostname "KeNoiPhone";
}


The details

So I modified a perl script to take all those lines and make sense of them.
I called it lease-examine.pl.
Here it is

#!/usr/bin/perl
# from https://askubuntu.com/questions/219609/how-do-i-show-active-dhcp-leases - DrJ 11/15/18
 
my $VERSION=0.03;
 
##my $leases_file = "/var/lib/dhcpd/dhcpd.leases";
my $leases_file = "/tmp/dhcpd.leases";
 
##use strict;
use Date::Parse;
 
my $now = time;
##print $now;
##exit;
# 12:22 PM 11/15/18 EST
#my $now = "1542302555";
my %seen;       # leases file has dupes (because logging failover stuff?). This hash will get rid of them.
 
open(L, $leases_file) or die "Cant open $leases_file : $!\n";
undef $/;
my @records = split /^lease\s+([\d\.]+)\s*\{/m, <L>;
shift @records; # remove stuff before first "lease" block
 
## process 2 array elements at a time: ip and data
foreach my $i (0 .. $#records) {
    next if $i % 2;
    ($ip, $_) = @records[$i, $i+1];
    ($ip, $_) = @records[$i, $i+1];
 
    s/^\n+//;     # && warn "leading spaces removed\n";
    s/[\s\}]+$//; # && warn "trailing junk removed\n";
 
    my ($s) = /^\s* starts \s+ \d+ \s+ (.*?);/xm;
    my ($e) = /^\s* ends   \s+ \d+ \s+ (.*?);/xm;
 
    ##my $start = str2time($s);
    ##my $end   = str2time($e);
    my $start = str2time($s,UTC);
    my $end   = str2time($e,UTC);
 
    my %h; # to hold values we want
 
    foreach my $rx ('binding', 'hardware', 'client-hostname') {
        my ($val) = /^\s*$rx.*?(\S+);/sm;
        $h{$rx} = $val;
    }
 
    my $formatted_output;
 
    if ($end && $end < $now) {
        $formatted_output =
            sprintf "%-15s : %-26s "              . "%19s "         . "%9s "     . "%24s    "              . "%24s\n",
                    $ip,     $h{'client-hostname'}, ""              , $h{binding}, "expired"               , scalar(localti
me $end);
    }
    else {
        $formatted_output =
            sprintf "%-15s : %-26s "              . "%19s "         . "%9s "     . "%24s -- "              . "%24s\n",
                    $ip,     $h{'client-hostname'}, "($h{hardware})", $h{binding}, scalar(localtime $start), scalar(localti
me $end);
    }
 
    next if $seen{$formatted_output};
    $seen{$formatted_output}++;
    print $formatted_output;
}

Even that script produces a thicket of confusing information. So then I further process it. I call this script dhcp-check.sh:

#!/bin/sh
# DrJ 11/15/18
# bring over current dhcp lease file from firewall FW-1
date
echo fetching lease file dhcpd.leases
scp admin@FW-1:/var/lib/dhcpd/dhcpd.leases /tmp
# analyze it. this should show us active leases
echo analyze dhcpd.leases
DIR=`dirname $0`
$DIR/lease-examine.pl|grep active|grep -v expired > /tmp/intermed-results
# intermed-results looks like:
#10.24.76.124   : "android-7fe22a415ce21c55" (50:92:b9:b8:92:a0)    active Thu Nov 15 11:32:13 2018 -- Thu Nov 15 15:32:13 2018
#10.24.76.197   : "android-283a4cb47edf3b8c" (98:39:8e:a6:4f:15)    active Thu Nov 15 11:37:23 2018 -- Thu Nov 15 15:32:14 2018
#10.24.70.236   : "other-Phone"            (38:25:6b:79:31:60)    active Thu Nov 15 11:32:24 2018 -- Thu Nov 15 15:32:24 2018
#10.24.74.133   : "iPhone-de-Lucia"          (34:08:bc:51:0b:ae)    active Thu Nov 15 07:32:26 2018 -- Thu Nov 15 15:32:26 2018
#exit
# further processing. remove the many duplicate lines
echo count active leases
awk '{print $1}' /tmp/intermed-results|sort -u|wc -l > /tmp/dhcp-active-count
echo count is `cat /tmp/dhcp-active-count`

And that script gives my what I believe is an accurate count of the active leases. I run it every 10 minutes from SiteScope and voila, we have a way to make sure we’re coming close to running out of IP addresses.

Categories
DNS

The IT detective agency: rogue IPv6 device messes up DHCP for entire subnet

Intro
This was a fascinating case insofar as it was my first encounter with a real life IPv6 application. So it was trial by fire.

The details
I think the title of the post makes clear what happened. The site people were saying they can ping hosts by IP but not by DNS name. So basically nothing was working. I asked them to do an ipconfig /all and send me the output. At the top of the list of DNS servers was this funny entry:

IPv6 DNS server shows up first

I asked them to run nslookup, and sure enough, it timed out trying to talk to that same IPv6 server. Yet they could PING it.

The DNS servers listed below the IPv6 one were the expected IPv4 our enterprise system hands out.

My quick conclusion: there is a rogue host on their subnet acting as an IPv6 DHCP server! It took some convincing on my part before they got on board with that idea.

But I goofed too. In my haste to move on, I confused an IPv6 address with a MAC address. Rookie IPv6 mistake I suppose. It looked strange, had letters and even colons, so it kind of looks like a MAC address, right? So I gave some quick advice to get rid of the problem: identify this address on the switch, find its port and disable it. So the guy looked for this funny MAC address and of course didn’t find it or anything that looked like it.

My general idea was right – there was a rogue IPv6 DHCP server.

My hypothesis as to what happened
The PCs have both an IPv4 as well as an IPv6 stack, as does just about everyone’s PC. These stacks run independently of each other. Everyone blissfully ignores the IPv6 communication, but that doesn’t mean it’s not occurring. I think these PCs got an IPv4 IP and DNS servers assigned to them in the usual way. All good. Then along came a DHCPv6 server and the PC’s IPv6 stack sent out a DHCPv6 request to the entire subnet (which it probably is doing periodically all along, there just was never a DHCPv6 server answering before this). This time the DHCPv6 server answered and gave out some IPv6-relevant information, including a IPv6 DNS server.

I further hypothesize that what I said above about the IPv4 and IPv6 stacks being independent is not entirely true. These stacks are joined in one place: the resolving nameservers. You only get one set of resolving namesevrers for your combined IPv4/IPv6 stacks, which sort of makes sense because DNS servers can answer queries about IPv6 objects if they are so configured. So, anyway, the DHCPv6 client decides to put the DNS server it has learned about from its DHCPv6 server at the front of the existing nameserver list. This nameserver is totally busted, however and sits on the request and the client’s error handling isn’t good enough to detect the problem and move on to the next nameserver in the list – an IPv4 nameserver which would have worked just great – despite the fact that it is designed to do just that. And all resolution breaks and breaks badly.

What was the offending device? They’re not saying, except we heard it was a router, hence, a host introduced by the LAN vendor who can’t or won’t admit to having made such an error, instead making a quiet correction. Quiet because of course they initially refused the incident and had us look elsewhere for the source of the “DHCP problem.”

Alternate theory
I see that IPv6 devices do not need to get DNS servers via DHCPv6. They can use a new protocol, NDP, neighbor discovery protocol. Maybe the IPv6 stack is periodically trying NDP and finally got a response from the rogue device and put that first on the list of nameservers. No DHCPv6 really used in that scenario, just NDP.

Useful tips for layer 2 stuff
Here’s how you can find the MAC of an IPv6 device which you have just PINGed:

netsh interface ipv6 show neighbors

from a CMD prompt on a Windows machine.

In Linux it’s

ip ‐6 neigh show

Conclusion
Another tough case resolved! We learn some valuable things about IPv6 in the process.

References and related
I found the relevant commands in this article: https://www.midnightfreddie.com/how-to-arp-a-in-ipv6.html