Categories
Admin Network Technologies

The IT Detective Agency: The mystery of the intermittent site-wide Internet outage

Intro

In this mystery I will give you plenty of red herrings to recreate the confusion that occurs during problem reporting.

The details

The site, let’s call it FP, had users with Internet problems. Someone reported all users affected. Incident Management was brought in to coordinate torubleshooting. Various side chats ensued amongst the IT staff.

Initially I was brought in to check our SASE solution. I didn’t notice issues there.

With my Visibility hat on I checked a ThousandEyes agent we have on site, actually both an Enterprise Agent and an Endpint Agent. Both looked OK. That was an important first step for me: I knew the outage was not total. The paths hadn’t changed so the tunnel to our SASE provider was still up.

So what then?

We had the wireless guy trying to isolate the problem to a wired versus wireless situation. Someone with a good head on their shoulders noticed the affected user (who was nicely cooperative) was not getting an assigned IP address. This same user had resorted to using his personal hotspot in order to work.

So I put on my dhcp hat and checked the mac address against the dhcp server log which is very detailed and lists every single dhcp transaction with mac addresses and dhcp options. I did not see any requests form that MAC.

Well, one provider, let’s call them PC, has responsibility for the network and the dhcp service, although they often don’t admit the latter. That’s convenient, right, the problem has to be one or the other?

Then I drove into work to test things firsthand and by that time the problem was resolved.

The root cause

So what was it? well, the dhcp server had been virtualized a few months ago. That all went well. Someone savvy vendor management-type person from our staff learned that the VM was transferred to a different ESX host this morning. A different vendor is resonsible for the ESX servers, by the way. They have an overarching project to get all VMs onto the newer hardware they had installed.

So, anyway, although the switch port on the original ESX network p orts was all properly configured with DHCP snooping enable (or whatever the correct dhcp settings need to be – you know what I mean), on the new ESX switchports this hadn’t been done. So the dhcp protocol traffic was getting dropped by the switch port. normally that’s a nice security feature to prevent rogue dhcp servers from replying to queries. But in this case it was fatal and confusing.

PC modified the switch port with to reflect that now this server runs an active dhcp service and all was good.

Case: closed.

Conclusions

Whom do you do trust? Every IT person has a built-in filter to assign a certain level of distrust to anyone else’s assertion based mainly on their reputation. The Servicedesk reports that “someone says” all users at the site have lost Internet access – that’s a low-trust assertion.

Now if your reputable networking colleague says they can’t access Internet, trust that but get more information. Is that Internet + Intranet? Do basic things like Outlook and Teams work?

At a large site, blanket statements are hard for any one individual to make because no one is really in a position to know about a total outage.

Everyone loves to assign blame as quickly as possible during troubleshooting. That’s often a lot harder than it looks and blanket statements not carefully thought through can lead you down the wrong path.

If a user does an ipconfig /all and their IP is a 169.254… it means they don’t have an IP at all.

dhcp leases are long-lasting so some people will work off their previous valid lease for awhile. Thus this outage will not typically affect everyone. Servers such as our EA will have static IPs and not be impacted by a dhcp problem.

I really feel we got lucky by having full cooperation of all vendors and personnel, and having bright people on staff who cut through the red herrings in order to quickly find the root cause. I could easily see this issue lingering for many more hours under only slightly different conditions.

Spring

This image has nothing to do with the above story but is just something pleasant to look at!

Leave a Reply

Your email address will not be published. Required fields are marked *