Categories
Admin Linux Network Technologies

The IT detective agency: The mystery of the non-validating DKIM record

Intro

A colleague of mine in another timezone created the necessary DKIM records in Cloudflare for a new mail domain. There was panic as the mail team realized too late these records were not validating. I was called in to help. Unfortunately at the beginning I only my smartphone to work with. Did you ever try to do this kind of detail work with a smartphone? Don’t.

The details

The smartphone thing is worthy of a separate post. I was getting somewhere, but it is like working with both hands tied behind yuor back.

So the mail team is telling me the dkim record doesn’t validate and showing me a screenshot of something from mxtoolbox to prove it.

I of course want to know the details so I can verify my mistakes before anyone else gets to – that’s how I roll!

Well, mxtoolbox, has a free validator for these dkim records which is pretty useful. Go to Supertool, then click the dropdown and select DKIM. A DKIM record involves a domain and a selector. Here’s a real live example for Hurricane Electric which uses he.net as their sending mail domain. So in their DNS the DKIM txt record for them looks like this when viewed from dig:

"v=DKIM1; k=rsa; p=MIICIjANBgkqhkiG9w0BAQEFAAOCAg8AMIICCgKCAgEAonNI5HmoWfZntOsU5G3t eKi70HHBhDMe7himvGBNfq119soydCj7KoR9DsFYAMqCcPghLY29ishIbzMKsCFy 68XN4MWOSrFr+ERDHIuLXcFvaYYQ0oI5HVcViKSX85/YLXe+5JUcf5VsKoBLifNy U1NFA3UPa6MHBIOcD+JVF6F67G9m7t+COhsrhcvl9x" "kNq2NAY0OxbBM+CM+V4p0J 6pgt0PqYGnwd9s3/P7TUD2jY9elJLB5CfIec4DDCROj3MgUyTl2JfBcNy0WGzkEl OpFipd5MMesZvgyIVBsgLY58hTPldYhekkKWlOhpMpYbAi8gnvk+aJv2jZcaYHpJ kLNrri+q2gMeEX30JSoXfYNKx+B6m1Udn7Ig2ngHNVTXgNZlCw6SvbfmwXBE97q5 iG1SOnrgLKQvtgZv08Y7k5sp9+2SfoOS5MSYt" "OTfCbtknUi/VbaU4kVE76jFB0xx 6CAoR1SC9lDJBGvyFMuGvyhOXTiYV44tk1fyrV9Ba4yaKi8dhgHwe9vVbCSK8Ebt CeMXrkS/I3Dc33B6+tM1poC06GVhxElpd8rHiWvNImBuqCWwtGDsXm4ulubTcjvS gglJrB7kl4l3+AcTZn15zCrePl6xHWtL29b9vEy1w7whgExoDHaXZl+Svne9pfZ7 esXNu+mfERmGb56OreCEQQMCAwEAAQ=="

This is the value for this record: henet-20240223-153551._domainkey.he.net

To validate this DKIM record in mxtoolbox we pull out the token in front of _domainkey and refer to it as the selector, and drop the _domainkey and enter it like this:

The problem with the DKIM entry I was assigned to rescue was that the DIM syntax check was not passing. Yet it looked just like the way the mail team requested. What is going on? How can this problem be broken down into smaller steps???

To be continued…

Appendix A
How did I know the exact selector for Hurricane Electric?

I looked at the SMTP headers of an email I received from them. I found this section:

DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=he.net;
	s=henet-20240223-153551

d must stand for domain and s for selector. This is all considered public information, albeit somewhat obscure. So the domain is he.net and the selector is henet-20240223-153551.

Categories
Firewall Network Technologies

The IT Detective Agency: The case of the unreliable WiFi call

Intro

It’s been awhile since I have added a case to the canon of It detective stories which I have personally solved. It’s not that things don’t need resolving. They do! But either they look like what has come before, so there’s nothing new, or they are so new I’m still in the middle of them and you never know if they will ever be solved… Such was the situation with today’s subject: WiFi calling.

WiFi calling, which most people are blissfully ignorant of, can be very necessary if you are in a large building which shields you from cell phone tower signals and does not have any in-building signal boosters. In this situation, as long as you’ve enabled WiFi calling on your phone, it will be smart enough upon seeing no cell signal, to switch to using WiFi, assuming an access point and WiFi is reachable.

Well, such is the case at some office building my company has. And wiFi calling was found to be OK for phones using T-Mobile. But not for Verizon. With Verizon (VZ) phones WiFi calling was at best unpredicatble: sometimes the call would go through and sometimes not.

Unfortunately there were a lot of parties involved in the communication path. WLCs (wireless LAN controllers) have access points (APs) connect to them. they in turn tunnel the communication to another site where the anchor controller resides. Then it gets handed off to a perimiter firewall for NATing and egress via Internet routers. The Internet routers have some sort of load-balancing in place. We don’t run them any more the way we used to. A vendor does that now. And firewalls are handled by a different group. And a different group is in charge of mobile devices. The phone also has a Global protect client and hence an always-on VPN connection. That part is run by yet another group! So you see how this gets impossibly messy. I realized I was in a pretty good place – probably th best place compared to anyone else – to do this troubleshooting however because I touched many of the groups or had “good friends” there.

What does failure look like?

On my phone, a failed attempt looks like this. I place a call, and it doesn’t go through. It also doesn’t not go through. I just never hear anything. I wait for up to a minute, because, who is going to wait more than a minute to hear something after they’ve dialed the number?

More details

At the site they convinced themselves that whereas one SSID works, a second SSID which actually uses the same path, does not. For my part I wasn’t so sure. Eventually under my fairly extensive testing I could produce the problem every time by rebooting my phone and then placing a WiFi call very quickly afterwards.

Fun aside: how to force
WiFi calling even when you have signal

On an Android device go to airplace mode. Your WiFi is then disabled. But you can re-enable your WiFi and airplace mode will stay on! Now when you bring up the built-in voice calling app, you will see the green phone icon with a WiFi icon superimposed over it. That’s how you know you are placing a WiFi call.

But then if I did nothing for about 30 minutes, often my next attempted WiFi call would go through! Go figure. And the call after that would work as well, etc. But maybe a couple hours later the whole thing would break again. I don’t think they were that systematic in their testing.

Verizon to the rescue

After spinning our wheels helplessly we finally got a call with a tech engineer from Verizon who was helpful. Because at some point you think to yourself, the app developer of the phone should be able to instrument the voice app with verbose logging to say what it thinks the problem is. Let’s switch to the firewall where I have good access to the logs as well as a good colleague willing to grind it out with me. Well this is a Checkpoint firewall and the logs are filled with drops. Checkpoint logging says First packet isn’t SYN. So what the VZ guy said which helped us focus is that you want to look for the tunnels to 14.20.0.0/16 or something like that. maybe it’s more like 14.20.128.0/17, or something that rhymes with that! In any case, we didn’t believe the First packet isn’t SYN drops were hurting us too much as we get those a lot, yet things just work.

Then there were dns requests to 8.8.8.8. Why? That’s not the dns server we configured in dhcp (another one of my sub-specialties). And even if the right dns server was being used, it was always possible it was hitting a dns firewall rule. So that had to be ruled out. And it did seem dns did not play into this. Then there was the worrisome matter of the vpn tunnel created by GPC. What if, somehow, these packets were going over that tunnel? They shouldn’t, but what if they do? Well, then we should see that traffic in the GPC logs (another of my sub-specialties). We didn’t. So I became somewhat comfortable ruling out GPC.

So back to VZ. The guy said on our test call that he saw the tunnel initially established, then there was no more communication over it. And so the tester did not receive the test call for him. So when we looked for destination 141.207…, yeah we could see IKE and IPSEC communication. We could see a tunnel being estabvlished over udp port 500, thn further communication to that same destination over udp port 4500. These are pretty much the standard ports for IKE. the VZ guy said he did not have access to be able to do a trace on the IKE peer. We could do a packet trace on our firewall however.

More testing

So we never did see an official drop in the checkpoint logs. Still, I began to suspect that firewall and my colleague agreed with me, or at least agreed to try some things. But first, another red herring. the VZ guy suggested we could trace the packets on the phone with pcapdroid or something like that. So I got that running on my phone. But to work it creates its own IKE tunnel, uses completely different IP addressing, and just generally makes it impossible to account for these IKE packets going to VZ.

On Checkpoint you have a general setting for how it will handle “NAT traversal” for IKE connections. It looks like this:

By the way, tracing on the firewall isn’t all that easy since there are two interfaces. We actually were running tcpdump on the inward-facing interface while running fw monitor on the outbound interface! That’s not so easy to coordinate. Neither D nor I had ever done it before. We never did reach that Aha moment where you say, look, the packet destined for the tunnel enters here, and doesn’t go out here. There was just too much competing traffic. But anyway, D wanted to play with the NAT traversal settings, which seemed easier.

First adjustment: aggressive aging

The first thing D did was to turn off aggressive aging. Well, that helped a lot. With that, I was able to place my WiFi calls successfully every time after a reboot!

But this thing is tricky. We were chatting. Some time had passed. I placed another test call. Nope. that one didn’t go through! Drat. We had more homework to do. I had been recording the exact times of the calls pretty carefully. About 16 minutes had elapsed between the two calls.

To be continued…

Conclusion

In one of our most difficult cases, we got WiFi calling working reliably on Verizon phones. There were a lot of parties involved and a lot of false leads: look for asymmetric routing, etc.. The real problem was the IKE NAT traversal settings on a Checkpoint firewall. everyone involved is much happier now.

Case: closed!

References and related

A cogent discussion of the many others having troubkle with this is found at this VZ community page: https://community.verizon.com/t5/Other-Network-Discussions/What-are-the-wifi-calling-firewall-ports-and-destination-IP/m-p/1080659

Categories
Admin Network Technologies

Ping sweep for network security engineers

Intro

I swear my bash programming skills are getting worse and worse. What I really need is a bash scripting tips blog entry to remind myself of my favorite bash scripting tips. I have this for python and I refer toit and add to it all the time. I don’t care if anyone else never uses it, it’s worth having all my used tips in one place as I find I constantly forget the basics due to infrequent usage.

Oh. So to the point. What this blog post is nominally about is to provide a useable medium-quality ping swep that a network security engineer would find useful.

Conditions
  • access to host on the subnet in question
  • this accessible host has a bash shell CLI, e.g., a Checkpoint firewall
  • ping and arp programs available
What it does

This script is designed to sweep through a /24 subnet, politely pausing one second per attempt. It send s a single PING to each IP. This is the things that makes it appealing to network security engineers. it does not require a reply, which is a common situation for network security appliances. It immediately checks the arp table afterwards to see if there is an arp entry (before that has a chance to age out). If so, it reports the IP as up.

The code

I call the program sweep.sh.

#!/bin/bash

is_alive_ping()
{
  ping -c 1 -W 1 $1 > /dev/null
# arp -an output looks like: ? (10.29.129.208) at 01:c0:ed:78:b3:dc [ether] on eth0
# or if not present, like ? (10.29.129.209) at <incomplete> on eth0
  arp -an|grep -iv incomplete|grep -qi $1\)
  [ $? -eq 0 ] && echo Node with IP: $i is up.
}

if [[ ! -n $1 ]];
then
  echo "No subnet passed. Pass three octects like 10.29.129"
  exit
fi
subnet=$1
for i in ${subnet}.{1..254}
do
is_alive_ping $i
sleep 1
done

Apologies for the lousy programming. But it gets the job done.

./sweep.sh 10.29.129
Node with IP: 10.29.129.1 is up.
Node with IP: 10.29.129.2 is up.
Node with IP: 10.29.129.3 is up.
Node with IP: 10.29.129.5 is up.
Node with IP: 10.29.129.6 is up.
Node with IP: 10.29.129.10 is up.
Node with IP: 10.29.129.50 is up.
Conclusion

As a network security engineer you may be asked if it’s safe to use a paricular IP on one of your subnets where you have your equipment plus equipment frmo other groups. I provide a ping sweep script which reports which IPs are taken, not relying on an ICMP REPLY, but just on the ARP table entry which gets created if a device is on the network.

References and related

None so far!

Categories
Network Technologies Raspberry Pi

Trying to improve my home WiFi with a range extender

Intro

My Teams meetings in the mornings had poor audio quality and sometimes I could not share my screen. My suspicions focused on my home WiFi Router, which is many years old. I decided to make an experiment and get a range extender. The results are, well, mixed at best.

Windows command

netsh wlan show interface

There is 1 interface on the system:
Name : Wi-Fi 
Description : Intel(R) Dual Band Wireless-AC 3168 
GUID : f1c094c0-fcb7-4e47-86ba-51df737e58c8 
Physical address : 28:c6:3f:8f:3a:27 
State : connected 
SSID : DrJohn 
BSSID : ec:c3:02:eb:2d:7c 
Network type : Infrastructure 
Radio type : 802.11ac 
Authentication : WPA2-Personal 
Cipher : CCMP 
Connection mode : Auto Connect 
Channel : 153 
Receive rate (Mbps) : 292.5 
Transmit rate (Mbps) : 292.5 
Signal : 99% 
Profile : DrJohn

802.11ac is WiFi 5. 802.11n is WiFi 2, to be clear about it.

What’s going on

My work laptop starts out using WiFi 5 (803.11ac). The signal is around 60% or so. So I guess not super great. Then after an hour or so it switches to WiFi 2 (802.11n)! Audio in my meetings gets disturbed during this time.

My WiFi Extender did not really change this behavior to my surprise! But maybe the quality is better.

One morning I started out on WiFi 4, the signal quality varied between 94% down to 61%, all while nothing was being moved, and within a matter of minutes! The lower Signal values are associated with slower transmit and receive rates, naturally. But at least with the extender WiFi 4 seems OK. It’s useable for my interactive meetings. In my experience, once you are on WiFi 4 you are very unlikely to automagically get switched back to WiFi 5. But the reverse is not true. So there’s a lot of variability in the signal over the course of minutes. But I stayed on WiFi 4 for over three hours without its changing. I connected to a differ SSID, then connected back to my _EXT SSID and, bam, WiFi 5, but only at 52% signal strength.

The way I know this behavior in detail is that I happen to have a ThousandEyes endpoint agent installed and I have access to this history of the connection quality, signal strength, thoughput, etc. ThousandEyes is pretty cool.

Further experimentation

The last couple days I’ve been getting WiFi 5 and it’s been sticking. What’s the difference? This sounds incredibly banal, but I stood the darn extender upright! That’s right, during those days when I was mostly getting WiFi 4 the Extender had all its antennae sticking out, but it was flat on a table. I am in a room across the hallway. Then I managed to stand it upright – a little tricky since it is pluued into an extension cord. I’m still across the hallway. But things have been behaving better ever since.

Does a WiFi extender create a new SSID?

Yes! It creates an SSID named after your SSID with an _EXT appended to that name. However, it is very important to note that it is a bridged network so it means your _EXT-connected devices see all your devices not on _EXT, and that makes it very convenient. The subnet used is your primary router’s subnet, in other words.

This TP-Link (see references) seems to have lots of nice features. MIMO, AP mode, mesh mode, etc. You may or may not need them right away. For instance, the device has several status LEDs which get kind of bright for a bedroom at nighttime. Originally we covered it with a dark T-Shirt. Then I looked at it and saw it has an LED switch! That’s right. Just press that LED switch and those way-too-bright LEDs stop illuminating, while the device keeps on working. A very small but thoughtful feature which you would never even think to look for but turns out to be important. It might have overheated had we kept it covered with that T-Shirt.

Raspberry Pi

A good command is:

sudo iwconfig wlan0

wlan0 IEEE 802.11 ESSID:"Music_EXT"
Mode:Managed Frequency:5.765 GHz Access Point: 9C:53:22:02:6B:59
Bit Rate=433.3 Mb/s Tx-Power=31 dBm
Retry short limit:7 RTS thr:off Fragment thr:off
Encryption key:off
Power Management:on
Link Quality=62/70 Signal level=-48 dBm
Rx invalid nwid:0 Rx invalid crypt:0 Rx invalid frag:0

To be continued…

References and related

TPLink AC1900 WiFi Range Extender at Amazon (Costs about $69. I do not get promotional credits!)

Categories
Firewall Linux Network Technologies

The IT Detective Agency: the case of the mysterious ICMP host administratively prohibited packets

Intro

I haven’t published a new case in a while, not for lack of cases, but more that they they all fall into something I’ve already written about. But today there is definitely something new.

Some details

Thousandeyes agent-to-agent communication was generally working for all our enterprise agents after fixing firewall rules, etc, except for this one agent hosted in Azure US East. Was it something funny about the firewalls on either side of the vpn tunnel to this cloud? Ping tests were working. But a connection to tcp port 49153, which is used for agent-to-agent communication gave a response in the form of an ICMP type 3 code 10 packet which said something like host administratively prohibited. What?

The Cisco TAM suggested to look at iptables. I did a listing with iptables -L. The output is pretty long and I’m not experienced looking at it. Nothing much jumped out at me, but I did note the presence of this line:

REJECT     all  —  anywhere             anywhere             reject-with icmp-host-prohibited

in a couple of the chains, which seemed suspicous.

An Internet search pointed towards firewalld since the agent is a Redhat 7.9 system. Indeed firewalld was running:

systemctl status firewalld
● firewalld.service - firewalld - dynamic firewall daemon
Loaded: loaded (/usr/lib/systemd/system/firewalld.service; enabled; vendor preset: enabled)
Active: active (running) since Thu 2023-10-12 15:26:25 UTC; 5h 45min ago

The suggestion is to test with firewalld disabled. Indeed this produced correct results – no more ICMP packets back.

But it’s probably a good security measure to run firewalld, so how to modify it? This note from Redhat was particularly helpful in learning how to add a rule to the firewall. I pretty much just needed to do this to permanently add my rule:

firewall-cmd –add-port 49153/tcp –permanent

Afterwards the agent-to-agent tests began to be run successfully.

Which runs first, tcpdump or firewalld?

tcpdump

This is a good question to ask because if the order had been different, and who knows, you might have your packets dropped before you ever see them on tcpdump. But tcpdump seems to get a pretty clean mirror of what the network interface gets before application or kernel processing.

The new equivalent to netstat -an

If I want to see the listening processes in Redhat I might do a

ss -ln

In the old days I memorized using netstat -an, but that is now frowned upon.

Conclusion

We solved a case where tcp packets were getting returned with an ICMP packet which basically said: prohibited. This was due to the host, a Redhat 7 system, having restricted ports due to firewalld running. Once firewalld was modified this traffic was permitted and Thousandeyes Tests ran successfully. We also proved that tcpdump runs before firewalld.

References and related

How to add rule to firewalld on Redhat-like systems.

Categories
Network Technologies Python

Python network diagram generator

Intro

Since they took away our Visio license to save licensing fees, some of us have wondered where to turn to. I once used the venerable old MS Paint after learning one of my colleagues used it. Some have turned to Powerpoint. Since I had some time and some previous familiarity with the components – for instance when I create CAD designs for 3D printing I am basically also doing CAD as code using openSCAD – I wondered if I could generate my network diagram using code? It turns out I can, at least the basic stuff I was looking to do.

Pillow

I’m sure there are much better libraries out there but I picked something that was very common although also very limited for my purposes. That is the python Pillow package. I created a few auxiliary functions to ease my life by factoring out common calls. I call the auxiliary modules aux_modules.py. Here they are.

from PIL import Image, ImageDraw, ImageFont
serverWidth = 100
serverHeight = 40
small = 5
fnt = ImageFont.truetype('/usr/share/fonts/truetype/dejavu/DejaVuSans.ttf', 12)
fntBold = ImageFont.truetype('/usr/share/fonts/truetype/dejavu/DejaVuSans-Bold.ttf', 11)

def drawServer(img_draw,xCorner,yCorner,text,color='white'):
# known good colors for visibility of text: lightgreen, lightblue, tomato, pink and of course white
# draw the server
    img_draw.rectangle((xCorner,yCorner,xCorner+serverWidth,yCorner+serverHeight), outline='black', fill=color)
    img_draw.text((xCorner+small,yCorner+small),text,font=fntBold,fill='black')

def drawServerPipe(img_draw,xCorner,yCorner,len,source,color='black'):
# draw the connecting line for this server. We permit len to be negative!
# known good colors if added text is in same color as pipe: orange, purple, gold, green and of course black
    lenAbs = abs(len)
    xhalf = xCorner + int(serverWidth/2)
    if source == 'top':
        coords = [(xhalf,yCorner),(xhalf,yCorner-lenAbs)]
    if source == 'bottom':
        coords = [(xhalf,yCorner+serverHeight),(xhalf,yCorner+serverHeight+lenAbs)]
    img_draw.line(coords,color,2)

def drawArrow(img_draw,xStart,yStart,len,direction,color='black'):
# draw using several lines
    if direction == 'down':
        x2,y2 = xStart,yStart+len
        x3,y3 = xStart-small,y2-small
        x4,y4 = x2,y2
        x5,y5 = xStart+small,y3
        x6,y6 = x2,y2
        coords = [(xStart,yStart),(x2,y2),(x3,y3),(x4,y4),(x5,y5),(x6,y6)]
    if direction == 'right':
        x2,y2 = xStart+len,yStart
        x3,y3 = x2-small,y2-small
        x4,y4 = x2,y2
        x5,y5 = x3,yStart+small
        x6,y6 = x2,y2
        coords = [(xStart,yStart),(x2,y2),(x3,y3),(x4,y4),(x5,y5),(x6,y6)]
    img_draw.line(coords,color,2)
    img_draw.line(coords,color,2)

def drawText(img_draw,x,y,text,fnt,placement,color):
# draw appropriately spaced text
    xy = (x,y)
    bb = img_draw.textbbox(xy, text, font=fnt, anchor=None, spacing=4, align='left', direction=None, features=None, language=None, stroke_width=0, embedded_color=False)
# honestly, the y results from the bounding box are terrible, or maybe I don't understand how to use it
    if placement == 'lowerRight':
        x1,y1 = (bb[0]+small,bb[1])
    if placement == 'upperRight':
        x1,y1 = (bb[0]+small,bb[1]-(bb[3]-bb[1])-2*small)
    if placement == 'upperLeft':
        x1,y1 = (bb[0]-(bb[2]-bb[0])-small,bb[1]-(bb[3]-bb[1])-2*small)
    if placement == 'lowerLeft':
        x1,y1 = (bb[0]-(bb[2]-bb[0])-small,bb[1])
    xy = (x1,y1)
    img_draw.text(xy,text,font=fntBold,fill=color)

How to use

I can’t exactly show my eample due to proprietary elements. So I can just mention I write a main program making lots of calls tto these auxiliary functions.

Tip

Don’t forget that in this environment, the x axis behaves like you learned in geometry class with positive x values to the right of the y axis, but the y axis is inverted! So positive y values are below the x axis. That’s just how it is in a lot of these programs. get used to it.

What I am lacking is a good idea to do element groupings, or an obvious way to do transformations or rotations. So I just have to keep track of where I am, basically. But even still I enjoy creating a network diagram this way because there is so much control. And boy was it easy to replicate a diagram for another one which had a similar layout.

It only required the Pillow package. I am able to develop my diagrams on my local PC in my WSL environment. It’s nice and fast as well.

Example Output

This is an example output from this diagram as code approach which I produced over the last couple days, sufficiently blurred for sharing.

Network diagram (blurred) resulting from use of this code-first approach
Conclusion

I provide my auxiliary functions which permit creating “network diagrams as code.” The results are not pretty, but networking people will understand them.

References and related

I developed a way to blur images using the Python Pillow package.

CAD as code: openSCAD is what I had in mind in taking this code first approach to building up geometries.

My disorganized cheat sheet of python language features I most commonly use.

Categories
Admin Linux Network Technologies Web Site Technologies

The IT Detective Agency: This site can’t be reached

Intro

It’s been awhile since I’ve had the opportunity to relatean IT mystery. After awhile they are repates of what’s already happened in the past, or it’s too complex to relate, or I was only peripherally involved. But today I came across a good one. It falls into the never been seen before category.

The details

A web server behind my web application firewall became unreachable. In the browser they get a message This site can’t be reached. The app owners came to me looking for input. I checked the WAF and it was fine. The virtual server was looking healthy. So I took a packet trace, something to this effect:

$ tcpdump -nni 0.0 host 192.168.2.124

14:00:45.180349 IP 192.68.1.13.42045 > 192.68.2.124.443: Flags [S], seq 1106553901, win 23360, options [mss 1460,sackOK,TS val 3715803515 ecr 0], length 0 out slot1/tmm3 lis=/Common/was90extqa.drjohn.com.app/was90extqa.drjohn.com_vs port=0.53 trunk=
14:00:45.181081 IP 192.68.2.124 > 192.68.1.13: ICMP host 192.68.2.124 unreachable - admin prohibited filter, length 64 in slot1/tmm2 lis= port=0.47 trunk=
14:00:45.181239 IP 192.68.1.13.42045 > 192.68.2.124.443: Flags [R.], seq 1106553902, ack 0, win 0, length 0 out slot1/tmm3 lis=/Common/was90extqa.drjohn.com.app/was9
0extqa.drjohn.com port=0.53 trunk=

I’ve never seen that before, ICMP host 192.68.2.124 unreachable – admin prohibited filter. But I know ICMP can be used to relay out-of-band routing information on occasion, though I do not see it often. I suspect it is a BAD THING and forces the connection to be shut down. Question is, where was it coming from?

The communication is via a firewall so I check the firewall. I see a little more traffic so I narrow the filter down:

$ tcpdump -nni 0.0 host 192.168.2.124 host 443

And then I only see the initial SYN packet followed by the RST – from the same source IP! So since I didn’t see the bad ICMP packet on the firewall, but I do see it on the WAF, I preliminarily conclude the problem exists on the WAF.

Rookie mistake! Did you fall for it? So very, very often, in the heat of debugging, we invent some unit test which we’ve never done before, and we have to be satisified with the uncertainty in the testing method and hope to find a control test somehow, somewhere to validate our new unit test.

Although I very commonly do compound filters, in this case it makes no sense, as I realized a few minutes later. My port 443 filter would of course exclude logging the bad ICMP packets because ICMP does not use tcp port 443! So I took that out and re-run it. Yup. bad ICMP packet still present on the firewall, even on the interface of the firewall directly connected to the server.

So at this point I have proven to my satisfaction that this packet, which is ruining the communication, really comes frmo the server.

What the server guys say

Server support is outsourced. The vendor replies

As far as the patching activities go , there is nothing changed to the server except distro upgrading from 15.2 to 15.3. no other configs were changed. This is a regular procedure executed on almost all 15.2 servers in your environment. No other complains received so far…

So, the usual It’s not us, look somewhere else. So the app owner asks me for further guidance. I find it’s helpful to create a test that will convince the other party of the error with their service. And what is one test I would have liked to have seen but didn’t cnoduct? A packet trace on the server itself. So I write

I would suggest they (or you) do a packet trace on the server itself to prove to themselves that this server is not behaving ini an acceptable way, network-wise, if they see that same ICMP packet which I see.

The resolution

This kind of thing can often come to a stand-off, or many days can be wasted as an issue gets escalated to sufficiently competent technicians. In this case it wasn’t so bad. A few hours later the app owners write and mention that the home-grown local firewall seemed suspect to them. They dsabled it and this traffic began to work.

They are reaching out to the vendor to understand what may have happened.

Case: closed!

Conclusion

An IT mystery was resolved today – something we’ve never seen but were able to diagnose and overcome. We learned it’s sometimes a good thing to throw a wider net when seeing unexpected reset packets because maybe just maybe there is an ICMP host unreachable packet somewhere in the mix.

Most firewalls would just drop packets and you wait for a timeout. But this was a homegrown firewall running on SLES 15. So it abides by its own ways of working, I guess. So because of the RST, your connection closes quickly, not timing out as with a normal network firewall.

As always, one has to maintain an open mind as to the true source of an issue. What was working yesterday does not today. No one admits to changing anything. Finding clever ad hoc unit tests is the way forward, and don’t forget to validate the ad hoc test. We use curl a lot for these kinds of tests. A browser is a complex beast and too much of a black box.

Categories
Network Technologies

How to force snmpwalk to convert strings to numeric OIDs

Intro

It’s a little hard to find this information on the Internet, so I’m amplifying the correct answer here by using my blog.

The details

I’m not super-competent with MIBs and such, but I manage for my purposes with my basic understanding. I have access to an F5 bigip with various IPSEC tunnels on it. I want to use Zabbix to check the status of those tunnels. So I do an SMPwalk like this:

snmpwalk -v3 … -c public 127.0.0.1 F5-BIGIP-SYSTEM-MIB::sysIpsecSpdStatTunnelState

which produces output like this line:

F5-BIGIP-SYSTEM-MIB::sysIpsecSpdStatTunnelState.”/Common/tunnel-01″.58401 = STRING: up

But I cannot take that as it is and use it in an snmpget like this:

snmpget -v3 … -c public 127.0.0.1 F5-BIGIP-SYSTEM-MIB::sysIpsecSpdStatTunnelState.”/Common/tunnel-01″.58401

That produces an error like this:

Unknown Object Identifier (Index out of range: /Common/tunnel-01 (sysIpsecSpdStatTrafficSelectorName))

So we need to convert the string into a numeric OID. But how?

The answer

Use the -On switch as an additional argument in your snmpwalk.

You will get a scary long OID, but it will at least be numeric.

Gonig further

You can then deconstruct the response and reconstitute the section at the beginning with a nice name. For my F5 example

.1.3.6.1.4.1.3375.2.1.2.17.1.3.1.14

becomes

F5-BIGIP-SYSTEM-MIB::sysIpsecSpdStatBytes

I think. Then preserve the following digits as is.

Conclusion

We have shown how to output a numeric OID from an snmpwalk. This, specifically, is sueful in converting a string embedded in the output into a numeric OID, which may then be used by other SNMP applications such as Zabbix which may or may not have the MIB file loaded. The secret is simply to use the -On switch in snmpwalk.

References and related

My Zabbix FAQ – questions you wish they had answered, can be very helpful

Categories
Admin Network Technologies TCP/IP

Verizon Airspeed Hotspot uses ipv6 and interferes with VPN client Global Protect

Intro

The headline says it all. I got my shiny brand new Verizon hotspot from Walmart. I managed to activate it and add it to my Verizon account (not super easy, but after a few stumbles it did work.) I tried it out my home PC – works fine. I tried it out on my work PC. No good. My Global Protect connection was unstable. It connects for about a minute, then disconnects, then connects, etc. Basically unusable.

The details

I have heard of possible problem with the GP client (version 5.2.11) and IPv6. So I looked to see if this hotspot could be handing out IPv6 info. Yes. It is. But is that really making a difference? I concocted a simple test. I disabled IPv6 on my Wi-Fi adapter, then re-tested the GP client. The connection was smooth as glass! No disconnects!

Disable ipv6 on your Wi-Fi adapter

Bring up a powershell as administrator. Then:

get-netadapterbinding -componentid ms_tcpip6

will show you the current state of ipv6 on your adapters.

disable-netadapterbinding -Name “Wi-Fi” -ComponentID ms_tcpip6

will disable ipv6 on your Wi-Fi. And

enable-netadapterbinding -Name “Wi-Fi” -ComponentID ms_tcpip6

will re-enable it.

ipconfig /all output

For the record, here are some interesting bits from running ipconfig /all:

Wireless LAN adapter Wi-Fi:

Connection-specific DNS Suffix . :
Description . . . . . . . . . . . : Intel(R) Dual Band Wireless-AC 8265
Physical Address. . . . . . . . . : 0C-BD-94-98-11-5B
DHCP Enabled. . . . . . . . . . . : Yes
Autoconfiguration Enabled . . . . : Yes
Temporary IPv6 Address. . . . . . : 2600:1001:b004:2b78:8ab:145c:d014:2edd(Deprecated)
IPv6 Address. . . . . . . . . . . : 2600:1001:b004:2b78:2cc0:71b0:7f1e:a973(Deprecated)
Link-local IPv6 Address . . . . . : fe80::2cc0:71b0:7f1e:a973%30(Preferred)
IPv4 Address. . . . . . . . . . . : 192.168.1.103(Preferred)

Subnet Mask . . . . . . . . . . . : 255.255.255.0
Lease Obtained. . . . . . . . . . : Thursday, April 21, 2022 4:54:04 PM
Lease Expires . . . . . . . . . . : Friday, April 22, 2022 4:54:04 AM
Default Gateway . . . . . . . . . : 192.168.1.1
DHCP Server . . . . . . . . . . . : 192.168.1.1
DHCPv6 IAID . . . . . . . . . . . : 302832932
DHCPv6 Client DUID. . . . . . . . : 00-01-00-01-28-89-F6-8E-B0-5C-DA-E6-09-0A
DNS Servers . . . . . . . . . . . : fe80::50ae:caff:fea8:1dbc%30
192.168.1.1
NetBIOS over Tcpip. . . . . . . . : Enabled

But, having done all that, I can only occasionally connect to GP. It seems to work slightly better at night. ipv6 does not seem to be the sole hiccup. No idea what the recipe for reliable success is. If I ever learn it I will publish it. Meanwhile, my phone’s hotspot, also VErizon, also handing out ipv6 info, usually permits me to connect to GP. It’s hard to see the difference.

Conclusion

The Verizon Airspeed Hotspot sends out a mix of IPv6 and IPv4 info to dhcp clients. Palo Alto Networks’ Global Protect client does not play well with that setup and wil not have a stable connection.

I do not think there is a way to disable IPv6 on the hotspot. However, for those with admin access it can be disabled on a Windows PC. And then GP will work just fine. Or not.

Oh, and by the way, otherwise the Airspeed works well and is an adequate solution where you need a good reliable hotspot. Well, in fact, don’t expect reliability like you have from a wired connection. After a couple hours, all users just got dropped for no apparent reason whatsoever.

Categories
Admin DNS Firewall Network Technologies TCP/IP

The IT detective agency: named times out tcp queries

Intro

I’ve been reliable running ISC’s BIND server for eons. Recently I had a problem getting my slave servers updated after a change to the primary master. What was going on there?

The details

This was truly a team effort. I saw that the zone file had differing serial numbers on the master versus the slave servers. My attempts to update via an rndc refresh zone was having no effect.

So I tried a zone transfer by hand: dig axfr drjohnstechtalk.com @50.17.188.196

That timed out!

Yet, regular dns qeuries went through fine: dig ns drjohnstechtakl.com @50.17.188.196

I thought about it and remembered zone transfers use TCP whereas standard queries use UDP. So I tried a TCP-based simple query: dig +tcp ns drjohnstechtalk.com @50.17.188.196. It timed out!

So of course one suspects the firewall, which is reasonable enough. And when I looked at the firewal I found some funny drops, though i cuoldn’t line them up exactly with my failed tests. But I’m not a firewall expert; I just muddle through.

The next day someone from the DNS group asked how local queries behaved? Hmm. never tried that. So I tried it: dig +tcp ns drjohnstechtalk.com @localhost. That timed out as well! That was a brilliant suggestion as we now could eliminate the firewall and all that complexity from the equation. Because I had tried to do packet traces on two different machines at the same time and line up the results. It wasn’t easy.

The whole issue was very concerning to us because we feared our secondaries would be unable to pudate their slave zones and ultimately time them out. The result would be devastating.

We have support, fortunately. A company that hearkens frmo the good old days, with real subject matter experts. But they’re extremely busy. We did not get a suggestion for a couple weeks. But eventually we did. They had seen this once before.

named time to respond to TCP-based queries

The above graph is from a Zabbix monitor showing how long it takes that dns server to respond to that simple query. 6 s is a time-out. I actually set dig to timeout at 2 s, but in wall-clock time it actually takes 6 s.

The fix

We removed this line from the options block of named.conf:

keep-response-order {any; };

The info fmo the experts is that most likely that was configured as a workaround to CVE-2019-6477 but that issue was fixed since 9.15.6.

Conclusion

We encountered the named daemon in a situation where it was unable to respond to TCP-based DNS queries and hence unable to do zone transfers. So although most queries use UDP, this was a serious issue for us and prevented zones from being updated on all authoritative nameservers.

As is the case with so many modern IT problems, the effect was not black or white. Failures were intermittent, and then permanent. A restart fixed ths issue (forgot to mention so far!). But we involved an expert to find the root cause and it was the presence of a single configuration line in our named.conf. After removing that all was good.