Categories
Firewall Linux Network Technologies

The IT Detective Agency: the case of the mysterious ICMP host administratively prohibited packets

Intro

I haven’t published a new case in a while, not for lack of cases, but more that they they all fall into something I’ve already written about. But today there is definitely something new.

Some details

Thousandeyes agent-to-agent communication was generally working for all our enterprise agents after fixing firewall rules, etc, except for this one agent hosted in Azure US East. Was it something funny about the firewalls on either side of the vpn tunnel to this cloud? Ping tests were working. But a connection to tcp port 49153, which is used for agent-to-agent communication gave a response in the form of an ICMP type 3 code 10 packet which said something like host administratively prohibited. What?

The Cisco TAM suggested to look at iptables. I did a listing with iptables -L. The output is pretty long and I’m not experienced looking at it. Nothing much jumped out at me, but I did note the presence of this line:

REJECT     all  —  anywhere             anywhere             reject-with icmp-host-prohibited

in a couple of the chains, which seemed suspicous.

An Internet search pointed towards firewalld since the agent is a Redhat 7.9 system. Indeed firewalld was running:

systemctl status firewalld
● firewalld.service - firewalld - dynamic firewall daemon
Loaded: loaded (/usr/lib/systemd/system/firewalld.service; enabled; vendor preset: enabled)
Active: active (running) since Thu 2023-10-12 15:26:25 UTC; 5h 45min ago

The suggestion is to test with firewalld disabled. Indeed this produced correct results – no more ICMP packets back.

But it’s probably a good security measure to run firewalld, so how to modify it? This note from Redhat was particularly helpful in learning how to add a rule to the firewall. I pretty much just needed to do this to permanently add my rule:

firewall-cmd –add-port 49153/tcp –permanent

Afterwards the agent-to-agent tests began to be run successfully.

Which runs first, tcpdump or firewalld?

tcpdump

This is a good question to ask because if the order had been different, and who knows, you might have your packets dropped before you ever see them on tcpdump. But tcpdump seems to get a pretty clean mirror of what the network interface gets before application or kernel processing.

The new equivalent to netstat -an

If I want to see the listening processes in Redhat I might do a

ss -ln

In the old days I memorized using netstat -an, but that is now frowned upon.

Conclusion

We solved a case where tcp packets were getting returned with an ICMP packet which basically said: prohibited. This was due to the host, a Redhat 7 system, having restricted ports due to firewalld running. Once firewalld was modified this traffic was permitted and Thousandeyes Tests ran successfully. We also proved that tcpdump runs before firewalld.

References and related

How to add rule to firewalld on Redhat-like systems.

Categories
Admin Firewall Linux

Linux: how to estimate bandwidth usage to a particular subnet

Intro

Let’s say someone asks you to estimate the total bandwith used by a particular subnet, or a particular service such as https on port 443. I provide a crude way to do that using tcpdump on a not-too-busy server.

The code

I call it bandwidth.sh. By the way, I ran it on a Checkpoint Gaia appliance so it works there as well.

#!/bin/bash
# DrJ 11/21
sleep=120
file=/tmp/ctpackets
sum() {
sum=0
cat $file|while read line; do
 length=$(echo $line|awk '{print $17}'|sed 's/)//')
 sum=$(expr $sum + $length)
 echo $sum
done
}
while /bin/true; do
tcpdump -c1000 -v -nni eth1 net 216.71/16 > $file
#10:29:49.471455 IP (tos 0x0, ttl 126, id 32399, offset 0, flags [none], proto: UDP (17), length: 105) 10.32.25.126.3391 > 216.71.170.32.61445: UDP, length 77
total=$(sum|tail -1)
t0=$(head -1 $file|awk '{print $1}')
t1=$(tail -1 $file|awk '{print $1}')
h0=$(echo $t0|cut -d: -f1|sed 's/^0//')
h1=$(echo $t1|cut -d: -f1|sed 's/^0//')
m0=$(echo $t0|cut -d: -f2|sed 's/^0//')
m1=$(echo $t1|cut -d: -f2|sed 's/^0//')
s0=$(echo $t0|cut -d: -f3|sed 's/^0//')
s1=$(echo $t1|cut -d: -f3|sed 's/^0//')
s0=$(echo $s0|cut -d\. -f1|sed 's/^0//')
s1=$(echo $s1|cut -d\. -f1|sed 's/^0//')
[ -z "$h0" ] && h0=0
[ -z "$h1" ] && h1=0
[ -z "$m0" ] && m0=0
[ -z "$m1" ] && m1=0
[ -z "$s0" ] && s0=0
[ -z "$s1" ] && s1=0
t0secs=$((3600*$h0+60*$m0+$s0))
t1secs=$((3600*$h1+60*$m1+$s1))
#echo total bytes: $total
elapsed=$(($t1secs-$t0secs))
#echo elapsed time: $elapsed
kbps=$(($total*8/$elapsed/1000))
echo $kbps kbps at $(date)
sleep $sleep
done

The idea

Running tcpdump with the -v switch gives us packet length. We find that length and sum it up. Here we used a filter epxression of 216.71/16 to capture only the traffic from that subnet.

The number of packets to capture has to be tuned to how busy it gets. Now it’s set to only capture 1000 packets. And you see my crude timings are truncated at the second. So 1000 packets in one second or about 1.5 MBytes/sec = 12 Mbps is the maximum sensitivy of this approach. I doub it will really work for interfaces with more thn 100 Mbps, even after you scaled up the count (and don’t forget to change the denominator in the kbps line!

Here’s a sample output:

1000 packets captured
2002 packets received by filter
0 packets dropped by kernel
5 kbps at Wed Nov 3 12:09:45 EDT 2021

I think it’s important to note the number of packets dropped by the kernel. So if it gets too busy as I underatdn it, it will at least try to tell yuo that it couldn’t capture all the data and at that point you can no longer trust this method. Perhaps with enhanced statistical methods it could be salvaged.

I don’t run it continuously to also give the kernel a breather. It probably doesn’t make much difference, but every two minutes seems plenty frequent to me…

Conclusion

We have demonstrated a crude but better-than-nothing script to calculate bandwidth for a given tcpdump filter expression. It won’t win any awards, but it contains some worthwhile ideas. And it seems to work at low bandwidth levels.

Categories
Firewall

Checkpoint Gaia admin tips

Intro

Suppose, hypothetically, that you had super admin access to a CMA in SmartConsole v 80.40, but lacked ssh or GUI access to firewalls within that CMA? What could you do? Can you run commands in a pinch? Yes. You can. Here are some concrete examples.

Caveats

In the servers section of the domain you can right-click and choose “Run one-time script.” That’s great, but I think there are limits. It will time out a script that takes too long. IDK, maybe 10 seconds or so is the maximum time allowed. The returned text gets truncated if it’s too long. 15 lines of text is OK. 200 is not. Somewhere inbetween those two is the limit.

Running clish commands

clish commands can indeed be run this way. I was interested in examining a few routes on a firewall with many static routes. I ran:

netstat -rn|grep 198.23|head -15

Set a static route

clish -sc “set static-route 197.6.75.0/24 nexthop gateway address 10.23.42.10 on”

Redistribute this route via BGP

clish -sc “set route-redistribution to bgp-as 38002.48928 from static-route 197.6.75.0/24 on”

Run a PING (best to restrict the number of ping packets)

ping -c3 1.1.1.1

Show a part of configuration, e.g., BGP stuff

clish -c “show configuration”|grep bgp|head -15

Show cluster IPs

cphaprob -vs all -a -m if|grep 10.182.136

Learn the name of the switch and switch port an interface is connected to (Cisco switches only)

This is a really awesome trick. And it works. Maybe it relies on something clled CDP. Not sure. But you run it and it will tell you the hostname of the switch and the port, e.g., eth1/5.

tcpdump -vnni eth1-08 -s 1500 -c 1 ‘(ether[12:2]=0x88cc or ether[20:2]=0x2000) and not tcp and not udp and not icmp’

The interface name eth1-08 above is just an example.

This command is general-putpose and works with any device with any OS, assuming you can run a packet trace with tcpdump or equivalent. Very cool.

Conclusion

Real firewall admins I know fail to realize that even when they lack shell access to a firewall they can pretty issue any command they need if they use the one-time script option in SmartConsole. It just helps to follow along the lines of the examples above – limiting output, etc. Even clish config changes can be made! A common reason to be in this situation is to learn someone changed a password or cleaned up old accounts.

As a bonus I show how to identify the name of the switch a firewall is connected to as well as the switch port. The general-purpose command works on any OS.

Categories
Admin DNS Firewall Network Technologies TCP/IP

The IT detective agency: named times out tcp queries

Intro

I’ve been reliable running ISC’s BIND server for eons. Recently I had a problem getting my slave servers updated after a change to the primary master. What was going on there?

The details

This was truly a team effort. I saw that the zone file had differing serial numbers on the master versus the slave servers. My attempts to update via an rndc refresh zone was having no effect.

So I tried a zone transfer by hand: dig axfr drjohnstechtalk.com @50.17.188.196

That timed out!

Yet, regular dns qeuries went through fine: dig ns drjohnstechtakl.com @50.17.188.196

I thought about it and remembered zone transfers use TCP whereas standard queries use UDP. So I tried a TCP-based simple query: dig +tcp ns drjohnstechtalk.com @50.17.188.196. It timed out!

So of course one suspects the firewall, which is reasonable enough. And when I looked at the firewal I found some funny drops, though i cuoldn’t line them up exactly with my failed tests. But I’m not a firewall expert; I just muddle through.

The next day someone from the DNS group asked how local queries behaved? Hmm. never tried that. So I tried it: dig +tcp ns drjohnstechtalk.com @localhost. That timed out as well! That was a brilliant suggestion as we now could eliminate the firewall and all that complexity from the equation. Because I had tried to do packet traces on two different machines at the same time and line up the results. It wasn’t easy.

The whole issue was very concerning to us because we feared our secondaries would be unable to pudate their slave zones and ultimately time them out. The result would be devastating.

We have support, fortunately. A company that hearkens frmo the good old days, with real subject matter experts. But they’re extremely busy. We did not get a suggestion for a couple weeks. But eventually we did. They had seen this once before.

named time to respond to TCP-based queries

The above graph is from a Zabbix monitor showing how long it takes that dns server to respond to that simple query. 6 s is a time-out. I actually set dig to timeout at 2 s, but in wall-clock time it actually takes 6 s.

The fix

We removed this line from the options block of named.conf:

keep-response-order {any; };

The info fmo the experts is that most likely that was configured as a workaround to CVE-2019-6477 but that issue was fixed since 9.15.6.

Conclusion

We encountered the named daemon in a situation where it was unable to respond to TCP-based DNS queries and hence unable to do zone transfers. So although most queries use UDP, this was a serious issue for us and prevented zones from being updated on all authoritative nameservers.

As is the case with so many modern IT problems, the effect was not black or white. Failures were intermittent, and then permanent. A restart fixed ths issue (forgot to mention so far!). But we involved an expert to find the root cause and it was the presence of a single configuration line in our named.conf. After removing that all was good.

Categories
Admin Firewall Proxy

Checkpoint SYN Defender: what you don’t know can hurt you

Intro

Our EDI group hails me last Friday and says they can’t reach their VANs, or at best intermittently. What to do, what to do… I go on the offensive and say they have to stop using FTP (and that’s literal FTP, not sftp, not FTPs, just plain old FTP), it’s been out of date for at least 15 years.

But that wasn’t really helping the situation, so I had to dig a lot deeper. And frankly, I was coincidentally having intermittent issues with my scripted speedtests. Could the two be related?

The details

We have a bunch of synthetic monitors we run though that same firewall. They were failing every few minutes, and then became good.

And these FTPs were like that as well. Some would work and then minutes later not work.

The firewall person on call looked at the firewall, saw some of the described traffic passing through, and declared firewall is fine.

So I got a more cooperative firewall colleague on this. And he got a really expert Checkpoint support person on the call. That guy led us to look at SYN DEFENDER which is part of IPS and enabled via fw accel. If it sees too many out of state packets in a given time it will shut down the interface where the problem was observed!

The practical effect is that even if you’re taking traces on the Checkpoint, checking the logs, etc, you won’t see the traffic! So that really throws most firewall admins is this situation is so unusual and they are not trained to look for it.

In this case it was an internal firewall and ir was comfortable to disable SYN DEFENDER on it. All problems went away after that.

Four months later…

Then four months later, after the firewall was upgraded to v 81.10, they must have set SYN DEFENDER (AKA synatk) up all over again. And of course no one was thinking about it or expecting what happened next, which is, these exact same problems started all over again. But there were different firewall colleagues involved, none with any first-hand experience of the issue. Then I got involved and just sort of tackled my way through it in a trouble-shooting session. No one was placing any judgments (my-stuff-is- fine,-yours-must-be-broken kind of thinking). Then I eventually recalled the old problem, and looked up this post to help name it – SYN DEFENDER – so that that would be meaningful to the firewall colleague. Yup, he took it from there. And we were good. I admonished the on-call guy who totally missed it, and he humbly admitted to not being familiar with this feature and how does it work. So I will explain it to him.

Results of running fwaccel synatk config:

enabled 0
enforce 0
global_high_threshold 10000
periodic_updates 1
cookie_resolution_shift 6
min_frag_sz 80
high_threshold 5000
low_threshold 1000
score_alpha 100
monitor_log_interval (msec) 60000
grace_timeout (msec) 30000
min_time_in_active (msec) 60000

These are probably the defaults as we haven’t messed with them. Right now you see it’s disabled. It spontaneously re-enabeld itself after only a few days, and the problems started all over again.

References and related

VAN: Value Added Network

Categories
Admin Firewall

The IT Detective Agency: large packets dropped by firewall, but logs show OK

Intro
All of a sudden one day I could not access the GUI of one my security appliances. It had only worked yesterday. CLI access kind of worked – until it didn’t. It was the standby part of a cluster so I tried the active unit. Same issues. I have some ill-defined involvement with the firewall the traffic was traversing, so I tried to debug the problem without success. So I brought in a real firewall expert.

More details
Of course I knew to check the firewall logs. Well, they showed this traffic (https and ssh) to have been accepted, no problems. Hmm. I suspected some weird IPS thing. IPS is kind of a big black box to me as I don’t deal with it. But I have seen cases where it blocks traffic without logging the fact. But that concern led me to bring in the expert.

By myself I had gotten it to the point where I had done tcpdump (I had totally forgotten how to use fw monitor. Now I will know and refer to my own recent blog post) on the corporate network side as well as the protected subnet side. And I saw that packets were hitting the corporate network interface that weren’t crossing over to the protected subnet. Why? But first some more about the symptoms.

The strange behaviour of my ssh session
The web GUI just would not load the home page. But ssh was a little different. I could indeed log in. But my ssh froze every time I changed to the /var/log directory and did a detailed directory listing ls -l. The beginning of the file listing would come back, and then just hang there mid-stream, frozen. In my tcpdump I noticed that the packets that did not get through were larger than the ones sent in the beginning of the session – by a lot. 1494 data bytes or something like that. So I could kind of see that with ssh, you normally send smallish packets, until you need a bigger one for something like a detailed directory listing! And https sends a large server certificate at the beginning of the session so it makes sense that it would hang if those packets were being stopped. So the observed behaviour makes sense in light of the dropping of the large packets. But that doesn’t explain why.

I asked a colleague to try it and they got similar results.

The solution method
It had nothing to do with IPS. The firewall guy noticed and did several things.

  • He agreed the firewall logs showed my connection being accepted.
  • He saw that another firewall admin had installed policy around the time the problem began. We analyzed what was changed and concluded that was a false lead. No way those changes could have caused this problem.
  • He switched the active firewall to standby so that we used the standby unit. It worked just fine!
  • He observed that the current active unit became active around the time of the problem, due to a problem with an interface on the normally active unit.

I probably would have been fine to just work using the standby but I didn’t want to crimp his style, so he continued in investigating…and found the ultimate root cause.

And finally the solution
He noticed that on the bad firewall the one interface – I swear I am not making this up – had been configured with a non-standard MTU! 1420 instead of 1500.

Analysis
I did a head slap when he shared that finding. Of course I should have looked for that. It explains everything. The OS was dropping the packet, not the firewall blade per se. And I knew the history. Some years back these firewalls were used for testing OLTV, a tunneling technology to extend layer 2 across physically separated subnets. That never did work to my satisfaction. One of the issues we encountered was a problem with large packets. So the firewall guy at the time tried this out to help. Normally firewalls don’t fail so the one unit where this MTU setting was present just wasn’t really used, except for brief moments during OS upgrade. And, funny to say, this mis-configuration was even propagated from older hardware to newer! The firewall guys have a procedure where they suck up all the configuration from the old firewall and restore to the newer one, mapping updated interface names, etc, as needed.

Well, at least we found it before too many others complained. Though, as expected, complain they did, the next day.

Aside: where is curl?
I normally would have tested the web page from the firewall iself using curl. But curl has disappeared from Gaia v 80.20. And there’s no wget either. How can such a useful and universal utility be missing? The firewall guy looked it up and quickly found that instead of curl, they have curl_cli. Who knew?

Conclusion
The strange case of the large packets dropped by a firewall, but not by the firewall blade, was resolved the same day it occurred. It took a partner ship of two people bringing their domain-specific knowledge to bear on the problem to arrive at the solution.

Categories
Admin Firewall Network Technologies Security

Linux shell script to cut a packet trace every 10 minutes on Checkpoint firewall

Intro
Scripts are normally not worth sharing because they are so easy to construct. This one illustrates several different concepts so may be of interest to someone else besides myself:

  • packet trace utility in Checkpoint firewall Gaia
  • send Ctrl-C interrupt to a process which has been run in the background
  • giving unqieu filenames for each cut
  • general approach to tacklnig the challenge of breaking a potentially large output into manageable chunks

The script
I wanted to learn about unexpected VPN client disconnects that a user, Sandy, was experiencing. Her external IP is 99.221.205.103.

while /bin/true; do
# date +%H%M inserts the current Hour (HH) and minute (MM).
 file=/tmp/sandy`date +%H%M`.cap
# fw monitor is better than tcpdump because it looks at all interfaces
 fw monitor -o $file -l 60 -e "accept src=99.221.205.103 or dst=99.221.205.103;" &
# $! picks up the process number of the command we backgrounded just above
 pid=$!
 sleep 600
 #sleep 90
 kill $pid
 sleep 3
 gzip $file
done

This type of tracing of this VPN session produces about 20 MB of data every 10 minutes. I want to be able to easily share the trace file afterwards in an email. And smaller files will be faster when analyzed in Wireshark.

The script itself I run in the background:
# ./sandy.sh &
And to make sure I don’t get logged out, I just run a slow PING afterwards:
# ping ‐i45 1.1.1.1

Alternate approaches
In retrospect I could have simply used the -ci argument and had the process terminate itself after a certain number of packets were recorded, and saved myself the effort of killing that process. But oh well, it is what it is.

Small tip to see all packets
Turn acceleration off:
fwaccel stat
fwaccel off
fwaccel on (when you’re done).

Conclusion
I share a script I wrote today that is simple, but illustrates several useful concepts.

References and related
fw monitor cheat sheet.

The standard packet analyzer everyone uses is Wireshark from https://wireshark.org/