Categories
Consumer Interest Consumer Tech Network Technologies Raspberry Pi

Consumer Tech: Home Internet stopped working

Intro

We woke up yesterday to no Internet. The usual remedies consumers go through did nothing to resolve the issue. What to do?

The details – November 25, 2020

The usual restarts or my router and the cable modem did not work. I plugged in my work laptop directly to the cable modem for some quick tests but that did not work.

I plugged my work-issued VPN router directly to the cable modem and it did not pick up an IP and re-establish the tunnel.

When I logged into my router I saw that its WAN IP was listed as 0.0.0.0, which means none at all.

I called the ISP twice. Both time they said they could “see” my modem, and they tried to restart it on their end, but that did not seem to do anything at all, based on the constant status LEDs (see picture below). I got my service visit moved up from Dec 11th to Dec 2nd, but still that would mean a week without Internet – not so great when three people are relying on it for their work.

I rebooted the cable modem a couple times at least. Nothing changed.

Then I started some research on quickie alternatives. Ask a friend from work for a spare Cradlepoint air card? They’re already out on vacation. Get a Chinese-made unlocked hotspot with pre-purchased data? Seems fishy, and ultimately expensive. Verizon brand hotspot? We had a borrowed one. Very finicky. And no ethernet ports.

Raspberry Pi + DIY approach?

At one point in the evening, convinced I would have to wait days for for a visit from the cable guy, I rigged up a spare Raspberry Pi to act as a router between a mobile hotspot (a companion tablet to a Verizon phone) and my Linksys router. Why bother? Why not just use the hotspot directly? Mostly because it’s a pain in the rear to reprogram all those Internet of Things devices one has in ones home these days, notably the several Echo Dots, but as well, a wireless printer, a few laptops, Firesticks, tablets, etc. With this approach I keep the WiFi SSID as it was for all those devices. And, it sort of worked! At least I got one Echo Dot to work. I didn’t push my luck. This stuff consumes a lot of data, even when “idle.”

To be continued…

Linksys WRT1200AC status lights – when healthy!
Cable Modem tatus lights – when operating normally

But I am pretty good at troubleshooting. What I know that less experienced people may not is that all the testing I’ve done to that point was not ironclad proof of failure of the cable modem. I know the traditional advice of old is to hook up a laptop directly to the ethernet port and work with it that way. Furthermore the cable company support said that my status lights were reading normally. So, when I tested my work laptop? Are you kidding? That thing has so many problems when I switch between SSIDs due to some new security software – it loves to display the Globe in the system tray, and the only recourse is to reboot. That’s what I was seeing, but notice I said a quickie test? I did not have time to do that reboot and all that. And that work-issued VPN router? I don’t know how that thing really works either. Never having set it up that way I did not trust reading too much into its results (which was essentially an orange status light instead of the usual white).

So when I had more time in the evening, I hooked up a home laptop which I know should work. After a cable modem reboot in fact I did get an IP and could surf the Internet. That was a glimmer of hope. So I put my router back in place. Still it did not pick up an WAN IP address. Still reading 0.0.0.0 for its IP.

Then I put the laptop back, writing down the IP, subnet mask and default gateway. Then I put my router back, switched its WAN mode from DHCP to fixed IP, putting on the exact IP address the laptop had picked up, with correct subnet mask and default gateway. Still it was not working. When the router is not working the WAN status light is sort of orange-ish. It’s white (pictured above) when the WAN link is communicating.

I decided the fault should lie more with my router than anywhere else, and since it wasn’t working and no number of power cycles was changing that situation, I decided that a factory reset is the thing to try. The last thing I could try. I noted the exact name and passwords of my SSIDs, held the reset button for 15 seconds until the status lights flicked out, and let it start up. It went through a start-up process, which i saw after connecting to its default IP of 192.168.1.1. It was clear it was not seeing the cable modem at the point where it should, but it had some very specific advice to try: power off cable modem, wait two minutes, power it back on, and then it would try again. And that did work! Yeah!

What may have precipitated this

My local cable company was recently bought by a much bigger company. I know for a fact what my WAN IP used to be, and I see it has changed. They now draw from a giant pool of IPs – a /14 in CIDR notation – that’s 262,000 addresses – that belongs to the new owner. So I believe the problem occurred due to a poor implementation of the dhcp protocol within my router, or a poor interplay between my router’s DHCP client and the ISP’s DHCP server. But I can’t research that line of troubleshooting because the ISP’s DHCP policies would require a lot of time-consuming experimentation on my part to reverse engineer based on observed behaviour under different conditions. And I would need an open source DHCP client – but I have the Raspberry Pi running dnsmasq for that, so that end could gather all the needed client information.

Prior to this acquisition I would tend to keep the same WAN IP for years – that’s how stable it was.

Another approach

Very germane to this topic is the fact that my neighbor down the street experienced his own Internet outage the day after I did! His solution was to buy a better cable modem. I did not know you could do that – I thought they were proprietary. He also saw his router with the 0.0.0.0 WAN address. And his approach also worked. This makes me less sure my router was really at fault – maybe Altice screwed up their DHCP service for half a day.

Conclusion

Unusual for me, I’m going to write the conclusion before writing the tedious part which is the full explanation in the middle.

By the end of the day I got the Internet working. After isolating the problem to my home router, the Linksys WRT1200AC, and determining that any amount of power cycling was not clearing things up, a factory reset did the trick! The cable modem and my cable Internet service was fine all along.

References and related

How to turn your Raspberry Pi into a router which shares your hotspot with your home router.

The Linksys WRT1200AC is no longer sold. It looks like the newer version is the WRT1900AC – it even looks identical. It’s a good router. I know there are fancier solutions out there, but there are also worse ones as well, so I can only give my qualified endorsement: https://www.amazon.com/Linksys-AC1900-Source-Wireless-WRT1900AC/dp/B014MIBLSA/ref=sr_1_1?dchild=1&keywords=linksys+wrt1200ac&qid=1606519765&sr=8-1

DHCP and CIDR notation are both described in great detail in their respective Wikipedia articles.

Categories
Admin Linux Network Technologies

Configure rsyslog to send syslog to SIEM server running TLS

Intro
You have to dig a little to find out about this somewhat obscure topic. You want to send syslog output, e.g., from the named daemon, to a syslog server with beefed up security, such that it requires the use of TLS so traffic is encrypted. This is how I did that.

The details
This is what worked for me:

...
# DrJ fixes - log local0 to DrJ's dmz syslog server - DrJ 5/6/20
# use local0 for named's query log, but also log locally
# see https://www.linuxquestions.org/questions/linux-server-73/bind-queries-log-to-remote-syslog-server-4175
669371/
# @@ means use TCP
$DefaultNetstreamDriver gtls
$DefaultNetstreamDriverCAFile /etc/ssl/certs/GlobalSign_Root_CA_-_R3.pem
$ActionSendStreamDriver gtls
$ActionSendStreamDriverMode 1
$ActionSendStreamDriverAuthMode anon
 
local0.*                                @@(o)14.17.85.10:6514
#local0.*                               /var/lib/named/query.log
local1.*                                -/var/log/localmessages
#local0.*;local1.*                      -/var/log/localmessages
local2.*;local3.*                       -/var/log/localmessages
local4.*;local5.*                       -/var/log/localmessages
local6.*;local7.*                       -/var/log/localmessages

The above is the important part of my /etc/rsyslog.conf file. The SIEM server is running at IP address 14.17.85.10 on TCP port 6514. It is using a certificate issued by Globalsign. An openssl call confirms this (see references).

Other gothcas
I am running on a SLES 15 server. Although it had rsyslog installed, it did not support tls initially. I was getting a dlopen error. So I figured out I needed to install this module:

rsyslog-module-gtls

References and related
How to find the server’s certificate using openssl. Very briefly, use the openssl s_client submenu.

The rsyslog site itself probably has the complete documentation. Though I haven’t looked at it thoroughly, it seems very complete.

Categories
Admin JavaScript Network Technologies

Practical Zabbix examples

Intro
I share some Zabbix items I’ve had to create which I find useful.

Convert DateAndTime SNMP output to human-readable format

Of course this is not very Zabbix-specific, as long as you realize that Zabbix produces the outer skin of the function:

function (value) {
// DrJ 2020-05-04
// see https://support.zabbix.com/browse/ZBXNEXT-3899 for SNMP DateAndTime format
'use strict';
//var str = "07 E4 05 04 0C 32 0F 00 2B 00 00";
var str = value;
// alert("str: " + str);
// read values are hex
var y256 = str.slice(0,2); var y = str.slice(3,5); var m = str.slice(6,8); 
var d = str.slice(9,11); var h = str.slice(12,14); var min = str.slice(15,17);
// convert to decimal
var y256Base10 = +("0x" + y256);
// convert to decimal
var yBase10 = +("0x" + y);
var Year = 256*y256Base10 + yBase10;
//  alert("Year: " + Year);
var mBase10 = +("0x" + m);
var dBase10 = +("0x" + d);
var hBase10 = +("0x" + h);
var minBase10 = +("0x" + min);
var YR = String(Year); var MM = String(mBase10); var DD = String(dBase10);
var HH = String(hBase10);
var MIN = String(minBase10);
// padding
if (mBase10 < 10)  MM = "0" + MM; if (dBase10 < 10) DD = "0" + DD;
if (hBase10 < 10) HH = "0" + HH; if (minBase10 < 10) MIN = "0" + MIN;
var Date = YR + "-" + MM + "-" + DD + " " + HH + ":" + MIN;
return Date;

I put that javascript into the preprocessing step of a dependent item, of course.

All my real-life examples do not fill in the last two fields: +/-, UTC offset. So in my case the times must be local times. But consequently I have no idea how a + or – would be represented in HEX! So I just ignored those last fields in the SNNMP DateAndTime which otherwise might have been useful.

Here’s an alternative version which calculates how long its been in hours since the last AV signature update.

// DrJ 2020-05-05
// see https://support.zabbix.com/browse/ZBXNEXT-3899 for SNMP DateAndTime format
'use strict';
//var str = "07 E4 05 04 0C 32 0F 00 2B 00 00";
var Start = new Date();
var str = value;
// alert("str: " + str);
// read values are hex
var y256 = str.slice(0,2); var y = str.slice(3,5); var m = str.slice(6,8); var d = str.slice(9,11); var h = str.slice(12,14); var min = str.slice(15,17);
// convert to decimal
var y256Base10 = +("0x" + y256);
// convert to decimal
var yBase10 = +("0x" + y);
var Year = 256*y256Base10 + yBase10;
//  alert("Year: " + Year);
var mBase10 = +("0x" + m);
var dBase10 = +("0x" + d);
var hBase10 = +("0x" + h);
var minBase10 = +("0x" + min);
var YR = String(Year); var MM = String(mBase10); var DD = String(dBase10);
var HH = String(hBase10);
var MIN = String(minBase10);
var Sigdate = new Date(Year, mBase10 - 1, dBase10,hBase10,minBase10);
//difference in hours
var difference = Math.trunc((Start - Sigdate)/1000/3600);
return difference;

Calculated bandwidth from an interface that only provides byte count
Again in this example the assumption is you have an item, probably from SNMP, that lists the total inbound/outbound byte count of a network interface – hopefully stored as a 64-bit number to avoid frequent rollovers. But the quantity that really excites you is bandwidth, such as megabits per second.

Use a calculated item as in this example for Bluecoat ProxySG:

change(sgProxyInBytesCount)*8/1000000/300

Give it type numeric, Units of mbps. sgProxyInBytesCount is the key for an SNMP monitor that uses OID

IF-MIB::ifHCInOctets.{$INTERFACE_TO_MEASURE}

where {$INTERFACE_TO_MEASURE} is a macro set for each proxy with the SNMP-reported interface number that we want to pull the statistics for.

The 300 in the denominator of the calculated item is required for me because my item is run every five minutes.

Alternative
No one really cares about the actual total value of byte count, right? So just re-purpose the In Bytes Count item a bit as follows:

  • add preprocessing step: Change per second
  • add second preprocessing step, Custom multiplier 8e-6

The first step gives you units of bytes/second which is less interesting than mbps, which is given by the second step. So the final units are mbps.

Be sure to put the units as !mbps into the Zabbix item, or else you may wind up with funny things like Kmbps in your graphs!

Creating a baseline

Even as of Zabbix v 5, there is no built-in baseline item type, which kind of sucks. Baseline can mean many different things to many people – it really depends on the data. In the corporate world, where I’m looking at bandwidth, my data has these distinct characteristics:

  • varies by hour-of-day, e.g., mornings see heavier usage than afternoons
  • there is the “Friday effect” where somewhat less usage is seen on Fridays, and extremely less usage occurs on weekends, hence variability by day-of-week
  • probably varies by day of month, e.g., month-end closings

So for this type of data (except the last criterion) I have created an appropriate baseline. Note I would do something different if I were graphing something like the solar generation from my solar panels, where the day-of-week variability does not exist.

Getting to the point, I have created a rolling lookback item. This needs to be created as a Zabbix Item of type Calculated. The formula is as follows:

(last(sgProxyInBytesCount,#1,1w)+
last(sgProxyInBytesCount,#1,2w)+
last(sgProxyInBytesCount,#1,3w)+
last(sgProxyInBytesCount,#1,4w)+
last(sgProxyInBytesCount,#1,5w)+
last(sgProxyInBytesCount,#1,6w))/6

In this example sgProxyInBytesCount is my key from the reference item. Breaking it down, it does a rolling lookback of the last six measurements taken at this time of day on this day of the week over the last six weeks and averages them. Voila, baseline! The more weeks you include the more likely you are to include data you’d rather not like holidays, days when things were busted, etc. I’d like to have a baseline that is from a fixed time, like “all of last year.” I have no idea how. I actually don’t think it’s possible.

But, anyway, the baseline approach above should generally work for any numeric item.

Refinement

The above approach only gives you six measurements, hence 1/sqrt(6) ~ 40% standard deviation by the law of large numbers, which is still pretty jittery as it turns out. So I came up with this refined approach which includes 72 measurements, hence 1/sqrt(72) ~ 12% st dev. I find that to be closer to what you intuitively expect in a baseline – a smooth approximation of the past. Here is the refined function:

(avg(sgProxyInBytesCount,1h,1w)+
avg(sgProxyInBytesCount,1h,2w)+
avg(sgProxyInBytesCount,1h,3w)+
avg(sgProxyInBytesCount,1h,4w)+
avg(sgProxyInBytesCount,1h,5w)+
avg(sgProxyInBytesCount,1h,6w))/6

I would have preferred a one-hour interval centered around one week ago, etc., e.g., something like 1w+30m, but such date arithmetic does not seem possible in Zabbix functions. And, yeah, I could put 84600s (i.e., 86400 – 1800), but that is much less meaingful and so harder to maintain. Here is a three-hour graph whose first half still reflects the original (jittery) baseline, and last half the refined function.

Latter part has smoothed baseline in light green

What I do not have mastered is whether we can easily use a proper smoothing function. It does not seem to be a built-in offering of Zabbix. Perhaps it could be faked by a combination of pre-processing and Javascript? I simply don’t know, and it’s more than I wish to tackle for the moment.

Data gap between mulitple item measurements looks terrible in Dashboard graph – solution

In a Dashboard if you are graphing items which were not all measured at the same time, the results can be frustrating. For instance, an item and its baseline as calculated above. The central part of the graph will look fine, but at either end giant sections will be missing when the timescale of display is 30 minutes or 60 minutes for items measured every five minutes or so. Here’s an example before I got it totally fixed.

Zabbix item timing mismatch

See the left side – how it’s broken up? I had beguin my fix so the right side is OK.

The data gap solution

Use Scheduling Intervals in defining the items. Say you want a measurement every five minutes. Then make your scheduling interval m/5 in all the items you are putting on the same graph. For good measure, make the regular interval value infrequent. I use a macro {$UPDATE_LONG}. What this does is force Zabbix to measure all the items at the same time, in this case every five minutes on minutes divisible by five. Once I did that my incoming bandwith item and its corresponding baseline item aligned nicely.

Low-level Discovery

I cottoned on to the utility of this part of Zabbix a little late. Hey, slow learner, but I eventually got there. What I found in my F5 devices is that using SNMP to monitor the /var filesystem was a snap: it was always device 32 (final OID digit). But /var/log monitoring? Not so much. Every device seemed different, with no obvious pattern. Active and standby units – identical hardware – and some would be 53, the partner 55. Then I rebooted a device and its number changed! So, clearly, dynamically assigned and no way was I going to keep up with it. I had learned the numbers by doing an snmpwalk. The solution to this dynamically changing OID number is to use low-level discovery.

Conclusion
A couple of really useful but poorly documented items are shared. Perhaps more will be added in the future.


References and related

https://support.zabbix.com/browse/ZBXNEXT-3899 for SNMP DateAndTime format

My first Zabbix post was mostly documentation of a series of disasters and unfinished business.

Blog post about calculated items by a true expert: https://blog.zabbix.com/zabbix-monitoring-with-calculated-items-explained/9950/

Low-level Discovery write-up: https://blog.zabbix.com/how-to-use-zabbix-low-level-discovery/9993/

Categories
Admin Firewall Network Technologies Security

Linux shell script to cut a packet trace every 10 minutes on Checkpoint firewall

Intro
Scripts are normally not worth sharing because they are so easy to construct. This one illustrates several different concepts so may be of interest to someone else besides myself:

  • packet trace utility in Checkpoint firewall Gaia
  • send Ctrl-C interrupt to a process which has been run in the background
  • giving unqieu filenames for each cut
  • general approach to tacklnig the challenge of breaking a potentially large output into manageable chunks

The script
I wanted to learn about unexpected VPN client disconnects that a user, Sandy, was experiencing. Her external IP is 99.221.205.103.

while /bin/true; do
# date +%H%M inserts the current Hour (HH) and minute (MM).
 file=/tmp/sandy`date +%H%M`.cap
# fw monitor is better than tcpdump because it looks at all interfaces
 fw monitor -o $file -l 60 -e "accept src=99.221.205.103 or dst=99.221.205.103;" &
# $! picks up the process number of the command we backgrounded just above
 pid=$!
 sleep 600
 #sleep 90
 kill $pid
 sleep 3
 gzip $file
done

This type of tracing of this VPN session produces about 20 MB of data every 10 minutes. I want to be able to easily share the trace file afterwards in an email. And smaller files will be faster when analyzed in Wireshark.

The script itself I run in the background:
# ./sandy.sh &
And to make sure I don’t get logged out, I just run a slow PING afterwards:
# ping ‐i45 1.1.1.1

Alternate approaches
In retrospect I could have simply used the -ci argument and had the process terminate itself after a certain number of packets were recorded, and saved myself the effort of killing that process. But oh well, it is what it is.

Small tip to see all packets
Turn acceleration off:
fwaccel stat
fwaccel off
fwaccel on (when you’re done).

Conclusion
I share a script I wrote today that is simple, but illustrates several useful concepts.

References and related
fw monitor cheat sheet.

The standard packet analyzer everyone uses is Wireshark from https://wireshark.org/

Categories
Network Technologies

What is BGP hijacking?

Intro
I just learned of this really clear explanation of BGP hijacking, including interactive links, and what can be done to improve the current situation, namely, implement RPKI. I guess this weighs on me lately after I learned about another massive BGP hijacking out of Russia. See the references for these links.

References and related
This is the post I was referring to above, very interactive and not too technical. Is BGP Safe Yet?

A more in-depth disucssion ow aht RPKI is all about is on this Cloudflare blog: https://blog.cloudflare.com/rpki/

A couple descriptions of the latest BGP hijack out of Russia: https://www.zdnet.com/article/russian-telco-hijacks-internet-traffic-for-google-aws-cloudflare-and-others/ and https://blog.qrator.net/en/how-you-deal-route-leaks_69/

Categories
Admin Network Technologies Web Site Technologies

Examining certificates over explicit proxy with openssl

Intro
This is pretty esoteric, but I’ve personally been waiting for this for a long time. It seems that beginning with openssl 1.1, the s_client sub-menu has had support for a proxy setting. Until then it was basically impossible to examine the certificates your proxy was sending back to users.

The syntax is something like:

openssl s_client -proxy <proxy_ip>:<proxy_port> -servername expired.badssl.com -showcerts -connect expired.badssl.com:443

where the proxy is a standard HTTP proxy.

Why is it a great thing? If your proxy does SSL interception then it is interfering with with the site’s normal certificate. And worse, it can good. What if its own signing certificate has expired?? I’ve seen it happen, and it isn’t pretty…

To find the openssl version just run openssl version.

My SLES12 SP4 servers have a version which is too old. My Cygwin install is OK, actually. My Redhat 7.7 has a version which is too old. I do have a SLES 15 server which has a good version. But even the version on my F5 devices is too old, surprisingly.

References and related
the openssl project home page: https://www.openssl.org/

A few of my favorite openssl commands.

Categories
Admin Linux Network Technologies

Quick Tip: Powershell command to unblock a firewall port when running Windows Defender

Setup
I decided to run an X Server on my Windows 10 laptop. I only need it for Cognos gateway configuration, but when you need it, you need it. Of course an X Server listens on port 6000, so hosts outside of your PC have to be able to initiate a TCP connection to your PC with destination port 6000. So that port has to be open. The software I use for the X Server is Mobatek XTerm.

Here is the Powershell command to disable the block of TCP port 6000.

New-NetFirewallRule -DisplayName "MobaXterm Allow Incoming Requests" -Direction Inbound -LocalPort 6000 -Protocol TCP -Profile Domain -Action Allow

The Powershell window needs to be run as administrator. The change is permanent: it suffices to run it once.

Conclusion
And, because inquiring minds want to know, did it work? Yes, it worked and I could send my cogconfig X window to my Mobatek X Server. I had to look for a new Window. It was slow.

Categories
Linux Network Technologies

Network utilities for Windows

Intro
Today I came across a simple but useful tool which runs on Windows systems that will help determine if a remote host is listening on a particular port. I wanted to share that information.

The details
PortQry is attractive because of its simplicity, plus, it is supported and distributed by Microsoft themselves. The help section reads like this:

PortQry version 2.0
 
Displays the state of TCP and UDP ports
 
 
Command line mode:  portqry -n name_to_query [-options]
Interactive mode:   portqry -i [-n name_to_query] [-options]
Local Mode:         portqry -local | -wpid pid| -wport port [-options]
 
Command line mode:
 
portqry -n name_to_query [-p protocol] [-e || -r || -o endpoint(s)] [-q]
        [-l logfile] [-sp source_port] [-sl] [-cn SNMP community name]
 
Command line mode options explained:
        -n [name_to_query] IP address or name of system to query
        -p [protocol] TCP or UDP or BOTH (default is TCP)
        -e [endpoint] single port to query (valid range: 1-65535)
        -r [end point range] range of ports to query (start:end)
        -o [end point order] range of ports to query in an order (x,y,z)
        -l [logfile] name of text log file to create
        -y overwrites existing text log file without prompting
        -sp [source port] initial source port to use for query
        -sl 'slow link delay' waits longer for UDP replies from remote systems
        -nr by-passes default IP address-to-name resolution
            ignored unless an IP address is specified after -n
        -cn specifies SNMP community name for query
            ignored unless querying an SNMP port
            must be delimited with !
        -q 'quiet' operation runs with no output
           returns 0 if port is listening
           returns 1 if port is not listening
           returns 2 if port is listening or filtered
 
Notes:  PortQry runs on Windows 2000 and later systems
        Defaults: TCP, port 80, no log file, slow link delay off
        Hit Ctrl-c to terminate prematurely
 
examples:
portqry -n myserver.com -e 25
portqry -n 10.0.0.1 -e 53 -p UDP -i
portqry -n host1.dev.reskit.com -r 21:445
portqry -n 10.0.0.1 -o 25,445,1024 -p both -sp 53
portqry -n host2 -cn !my community name! -e 161 -p udp
...

The PortQry “install” consisted of unzipping a ZIP file, so, no install at all, and no special permissions needed, which is a plus in my book.

nmap
Of course there is always nmap. I never really got into it so much, but clearly you can go nuts with it. One advantage is that it is available on linux and MacOS as well. But in my opinion it is a heavy-handed install.

References and related
PortQry

nmap

Some nmap examples I have used.

Categories
Admin Network Technologies

Monitoring by Zabbix: a working document

Intro
I panned Zabbix in this post: DIY monitoring. But I have compelling reasons to revisit it. I have to say it has matured, but there remain some very frustrating things about it, especially when compared with SiteScope (now owned by Microfocus) which is so much more intuitive.

But I am impressed by the breadth of the user base and the documentation. But learning how to do any specific thing is still an exercise in futility.

I am going to try to structure this post as a problems encountered, and how they were resolved.

Current production version as of this writing?
Answer: 5.0

Zabbix Manual does not work in Firefox
That’s right. I can’t even read the manual in my version of Firefox. Its sections do not expand. Solution: use Chrome

Which database?
You may see references to MYSQL in Zabbix docs. MYSQL is basically dead. what should you do?

Zabbix quick install on Redhat

Answer
Install mariadb which has replaced MYSQL and supports the same commands such as the mysql from the screenshot. On my Redhat instance I have installed these mariadb-related repositories:

mariadb-5.5.64-1.el7.x86_64
mariadb-server-5.5.64-1.el7.x86_64
mariadb-libs-5.5.64-1.el7.x86_64

Terminology confusion
what is a host, a host group, a template, an item, a web scenario, a trigger, a media type?
Answer

Don’t ask me. When I make progress I’ll post it here.

Web scenario specific issues
Can different web scenarios use different proxies?
Answer: Yes, no problem. In really old versions this was not possible. See web scenario screenshot below.

Can the proxy be a variable so that the same web scenario can be used for different proxies?
Answer: Yes. Let’s say you attach a web scenario to a host. In that host’s configuration you can define a “macro” which sets the variable value. e.g., the value of HTTP_PROXY in my example. I think you can do the same from a template, but I’m getting ahead of myself.

Similarly, can you do basic proxy auth and hide the credentials in a MACRO? Answer: I think so. I did it once at any rate. See above screenshot.

Why does my google.com web scenario work whereas my amazon.com scenario not when they’re exactly the same except for the URL?
Answer: some ideas, but the logging information is bad. Amazon does not take to bots hitting it for health check reasons. It may work better to change the agent type to Linux|Chrome, which is what I am trying now. Here’s my original answer: Even with command-line curl I get an error through this proxy. That can’t be good:
$ curl ‐vikL www.amazon.com

...
NSS error -5961 (PR_CONNECT_RESET_ERROR)
* TCP connection reset by peer
* Closing connection 1
curl: (35) TCP connection reset by peer


My amazon.com web scenario is not working (status of 1), yet in dashboard does not return any obvious warning or error or red color. Why? Answer:
no idea. Maybe you have to define a trigger?

Say you’re on the Monitoring|latest data screen. Does the data get auto-updated? Answer: yes, it seems to refresh every 30 seconds.

Why is the official FAQ so useless? Answer: no idea how a piece of software otherwise so feature-rich could have such a useless FAQ.

Zabbix costs nothing. Is it still actively supported? Answer: it seems very actively supported for some reason. Not sure what the revenue model is, however…

Can I force one or more web scenarios to be run immediately? I do this all the time in SiteScope. Answer: I guess not. There is no obvious way.

Suppose you have defined an item. what is the item key? Answer: You define it. Best to make it unique and use contiguous characters. I’m seeing it’s very important…

What is the equivalent to SiteScope’s script monitors? Answer: Either ssh check or external check.

How would you set up a simple PING monitor, i.e., to see if your host is up? Answer: Create an item as a “simple check”, e.g., with the name ping this host, and the key icmpping[{HOST.IP},3]. That can go into a template, by the way. If it succeeded it will return a 1.

I’ve made an error in my script for an external check. Why does Latest data show nothing at all? Answer: no idea. If the error is bad enough Zabbix will disable the item on you, so it’s not really running any longer. But even when it doesn’t do that, a lot of times I simply see no output whatsoever. Very frustrating.

Help! The Latest Data graph’s Y axis only shows 0’s and 5’s. Answer: Another wonderful Zabbix feature, this happens because your Units are too long. Even “per minute” as Units can get you into trouble if it is trying to draw a Y axis with values 22.0 22.5 23.0, etc: you’ll only see the .0’s and the .5’s. Change units to a maximum of seven characters such as “per min.”

Why is the output from an ssh check truncated, where does the rest go? Answer: no idea.

How do you increase the information contained in the zabbix server log? Let’s say your zabbix server is running normally. Then run this command: zabbix_server ‐R log_level_increase
You can run it multiple times to keep increasing the verbosity (log level), I think.

Attempting to use ssh items with key authentication fails with :”Public key authentication failed: Callback returned error” Initially I thought Zabbix was broken with regards to ssh public key authentication. I can get it to work with password. I can use my public/private key to authenticate by hand from command-line as root. Turns out running command such as sudo -u zabbix ssh … showed that my zabbix account did not have permissions to write to its home directory (which did not even exist). I guess this is a case of RTFM, because they do go over all those steps in the manual. I fixed up permissions and now it works for me, yeah.

Where should the scripts for external checks go? In my install it is /usr/lib/zabbix/externalscripts.

Why is the behaviour of triggers inconsistent. sometimes the same trigger has expected behaviour, sometimes not. Answer: No idea. Very frustrating. See more on that topic below.

How do you force a web scenario check when you are using templates? Answer: No idea.

Why do (resolved) Problems disappear no matter how you search for them if they are older than, say, 30 minutes? Answer: No idea. Just another stupid feature I guess.

Why does it say No media defined for user even though user has been set up with email as his media? Answer: no idea.

Why do too many errors disable an ssh check so that you get Status Disabled and have no graceful way to recover? Answer: no idea. It makes sense that Zabbix should not subject itself to too many consecutive errors. But once you’ve fixed the underlying problem the only recovery I can figure is to delete the item and recreate it. or delete the host and re-create it. Not cool.

I heard dependent items are the way to go to parse complex data coming out of a rich text item. How do you do that? Answer: Yes they are. I have gotten them to work and really give me the fine-grained control I’ve always wanted. I hope to show a real-life example soon. To get started creating a dependent item you can right-click on the dots of an item, or create a new item and choose type Dependent Item.

I am looking at Latest data and one item is grayed out and has no data. Why? Answer: almost no idea. This happens to me in a dependent item formed by a regular expression where the regular expression does not match the content. I am trying to make my RegEx more flexible to match both good and error conditions.

Why do my dependent items, when running a Check Now, say Cannot send request: wrong data type, yet they are producing data just fine when viewed through Latest data? Answer: this happens if you ran a Check Now on your template rather than when viewing an individual host. Make sure you select a host before you run Check Now. Actually, even still it does not work, so final answer: no idea.

Why do some regular expressions check out just fine on regex101.com yet produce a match for value of type “string”: pattern does not match error in Zabbix? Answer: Some idea. Fancy regular expressions do not seem to work for some reason.

Every time I add an item it takes the absolute maximum amount of time before I see data, whether or not I run check Now until I turn blue in the face. Why? Answer: no idea. Very frustrating.

If the Zabbix server is in one time zone and I am in another, can I have my view of timestamps customized to my time zone? Otherwise I see all times in the timezone of the Zabbix server. Answer: You are out of luck. The suggestion is to run two GUIs, one in your time zone. But there is a but. Support for this has been announced for v 5.20. Stay tuned…

My DNS queries using net.dns don’t do anything. Why? Answer: no idea. Maybe your host is not running an actual Zabbiox agent? That’ll do it. Forget that net.dns check if you can’t install an agent. Zabbix has no agentless DNS monitor for some strange reason.

A DNS query which returns many address records fails (such as querying an AD domain), though occasionally succeeds. Why? Answer: So your key looks something like this, right? net.dns.record[10.1.2.3,my-AD-domain.net,A,10,2,tcp]. And when you do the query through dig it works fine, right? E.g., dig +tcp my-AD-domain.net @10.1.2.3. And you’ve set the Zabbix response to type text? It seems to be just another Zabbix bug. You may have to use a script instead. Zabbix support has been able to reproduce this bug and they are working on it as we speak.

What does Check/Execute Now really do? Answer: essentially nothing as far as I tell. It certainly doesn’t “check now”, i.e., force the item to be run. However, if you have enough permissions, what you can do when you’re looking at an item for a specific Host is to run Test. Then Get Value. I usually get Permission Denied, however.

I want to show multiple things on a dashboard widget graph like an item plus its baseline (Ed: see references for calculating a baseline). What’s the best way? Answer: You can use the add new data set feature for instance to add your baseline. In your additional data set you put your baselines. Then I like to make the width 2, transparency 0 and fill 0. This will turn it into a thin bold line with a complementary color while not messing too much with the original colors of your items. The interface is squirrely, but, hey, it’s Zabbix, what did you expect?

I have a lot of hosts I want to add to a template. Does that Mass Update feature actually work? Answer: yes. Use it. It will save you time.

Help! I accidentally deleted an entire template. I meant to just delete one of its macros. Is there a revert? Answer: it doesn’t look like it. Hope you remember what you did…

It seems if I choose units in an item which have too many characters, e.g., client connections, the graph (in Latest Data) cuts it off and doesn’t even display the scale? Answer: seems so. It’s a bug. This won’t happen when using vector graphs in Dashboard. The graphs in Latest Data are PNG and limited to short Units, e.g., mbps. Changing to vector graphs has been in the roadmap but then disappeared.

Can I create a baseline? Nope. It’s on the roadmap. However, see this clever idea for building one on your own without too much effort.

I’ve put a few things on the same Dashboard graph. Why don’t they align? There are these big gaps. Zabbix runs the items when it feels like, and the result is gaps in data which Zabbix makes no attempt to conceal at the beginning and end of a graph. You can use Scheduling Intervals on your items to gain some control over this. See this article for details.

Besides cloning the whole thing, how can I change the name of a Dashboard? Answer: If you just click to edit a Dashboard the name appears fixed. However, click on the gear icon and that gives you the option to edit the dashboard name. It’s kind of an undocumented feature.

My SNMP MIB has bytes in/out for an interface when what I really want is bandwidth, i.e., Megabits per second. A little preprocessing on a 64-bit bytes value and you are there (32 bit values may roll over too frequently). See this article for details.

In functions like avg (sec|#num,<time_shift>), why is the time_shift argument so restricted? It can’t be a macro, contain a formula like 1w-30m, or anything semi-sophisticated. It just accepts a dumb literal like 5h? Answer: It’s just another shortcoming in Zabbix. How much did you pay for it? 🙂

I have an SNMP template with items for a hostgroup of dispersed servers. Some work fine. The one in Asia returns a few values, but not all. I am using Bulk Request. Answer (to your implied question!) You must have bad performance to that one. Use a Zabbix proxy with a longer timeout for SNMP requests. I was in that situation and that worked for me.

Riddle: when is 80% not 80%? Answer: when pulling in used storage on a filesystem via SNMP and comparing it to storage size! I had carefully gotten a filesystem 83% full based on the output of df -m. But my trigger, set to go off at 80%, never went off. How could it be? The 83% includes some kind of reserved user space on the filesystem which is not included when you do the calculation directly. So I was at 78% or so in actuality. I changed the trigger to 75%.

My trigger for a DNS item, which relies on a simple diff(), goes off from time-to-time yet the response is the same. Why? Answer: We have seen this behavior for a CNAME DNS item. The response changed the case of the returned FQDN from time-to-time, and that is enough to set off the Zabbix diff()-based trigger! We pre-processed the output with a RegEx to just get the bits we wanted to examine to fix this.

Related question. My diff() trigger for a DNS item does NOT go off when the server actually goes down. What’s up with that? Answer: Although you might expect a suddenly unavailable server constitutes a “difference,” in Zabbix’s contorted view of reality it does not. I recommend an additional trigger using the function nodata().

Does the new feature of login using SAML actually work? Answer: Yes, we are using it in Zabbix v 5.0.

My OIDs for my filesystems keeps shifting around. What to do? Answer: Use low-level discovery. It’s yet another layer of abstraction and confusion, but it’s probably worth it. I intend to write up my approach in my practical Zabbix examples blog post.

After an Zabbix agent item goes bad (no data), Zabbix refuses to test it for a full 30 minutes after it went bad, despite an update interval of 5 minutes. Why? Answer: In one of the worst architectural decisions of all time, Zabbix created the concept of unsupported items. It works something like this: the very moment when you need to be told Hey there’s something wrong here is when Zabbix goes quiet. Your item became unsupported, which is like being in the penalty box for 30 minutes, during which time nothing works like you naively expected it to. Even the fact that your item became unsupported is almost impossible to find out from a trigger. An example of software which treats this situation correctly is Microfocus SiteScope. In Zabbix in version 5.0 there’s a global timeout for all unsupported items. Ours is set to 30 minutes, you see. In some cases that may make sense and prevent Zabbix from consuming too many resources trying to measure things which don’t work. I find it annoying. For DNS, specifically, best to use a key of type net.dns and not net.dns.record. That returns a simple 0 or 1 and does not become unsupported if the dns server can’t be reached. V 5.2 will provide some more options around this issue. For a HTTP agent and I suppose many other items, it’s best to create triggers which use the nodata() function,  which can somewhat compensate for this glaring weakness in Zabbix.

I wish to use a DNS value instead of an IP in net.tcp.service[service,IP,port] because I use geoDNS or round-robin DNS. Can I? Answer: It seems to work, yes.

Can I send alerts to MS teams? Answer: This is obviously a fake question. But the answer is Yes. You set up a Connector in a MS Teams channel. It’s pretty staight forward and it’s pretty cool. I’ll try to publish more in my Zabbix tips post if I have time.

Get a lot of false positives? Answer: Yes! On F5 equipment this one is vexing me:

Resolved: BIG-IP is unreachable via SNMP for 15 minutes

And for others (pool member unavailable for a few minutes) I tried to require two consecutive failures before sending an alert. Basically still working on it.

A word about SSH checks and triggers
Through the school of hard knocks I have learned that my ssh check is clipping the output from the executed command. So you know that partial data you see when you look at latest data, and thought it was truncating it for display purposes? Nuh, ah. That’s all you’re getting to go up against in your trigger, which sucks. It’s something like 260 characters. I got lucky in a sense to discover this early by running an ssh check against dns resolution of amazon.com. The response I got varied almost every 60 seconds depending on whether or not the response came out of the dns cache. So this was an excellent testbed to learn about the flakiness of triggers as well as waste an entire day.

Another thing about triggers with a regex. As far as I can tell the logic is reversed. So you think you’re defining the OK condition when you seek to match the output and have it given the value of 1. But instead try to match the desired output for the OK condition, but assign it a value of 0. I guess. Only that approach seems to work for me. And getting the regex to treat multiple lines as a unit was also a little tricky. I think by default it favored testing only against the last line.

So let’s say my output as scraped from Monitoring|Latest Data alternated between either

proxy1&gt;test dns amazon.com
Performing DNS lookup for: amazon.com
 
DNS Response data:
Official Host Name: amazon.com
Resolved Addresses:
  205.251.242.103
  176.32.98.166
  176.32.103.205
Cache TTL: 1, cache HIT
DNS Resolver Response: Success

or

proxy1&gt;test dns amazon.com
Performing DNS lookup for: amazon.com
 
Sending A query for amazon.com to 192.168.135.145.
 
Sending A query for amazon.com to 8.8.8.8.
 
DNS Response data:
Official Host Name: amazon.com
Resolved Addresses:
  20

, then here is my iregexp expression which seems to do the correct thing (treat both of these outcomes as successes):

{proxy1:ssh.run[resolve DNS,1.2.3.4,22,utf-8].iregexp("(?s)((205\.251\.|176\.32\.)|Sending A query.+\s20)")}=0

Note that the (?s) at the beginning helps, I think, to treat the newline character as just another character which matches “.”. I may have an extra set of parentheses around the outermost alternating expression, but I can only experiment so much…

I ran various tests such as to change just one of the numbers to make sure it triggered.

I now think I will get better, i.e., more complete, results if I make the item of type text rather than character, at least that switch definitely helped with another truncated output I was getting from another ssh check. So, yes, now I am capturing all the output. So, note to self, use type text unless you have really brief output from your ssh check.

So with all that gained knowledge, my simplified expression now reads like this:

{proxy:ssh.run[resolve a dns name,1.2.3.4,22,utf-8].iregexp("(205\.251\.|176\.32\.)")}=0

Here’s a CPU trigger. From a show status it focuses on the line:

CPU utilization: 29%

and so if I want to trigger a problem for 95% or higher CPU, this expression works for me:

CPU utilization:\s+([ 1-8]\d|9[0-4])\%

A nice online regular expression checker is https://regexr.com/

And a very simple PING test ssh check item, where the expected resulting line will be:

5 packets transmitted, 5 packets received, 0% packet loss

– for that I used the item wizard, altered what it came up with, and arrived at this:

(({proxy:ssh.run[ping 8.8.8.8,1.2.3.4,22,utf-8].iregexp("[45] packets received")})=0)

So I will accept the results as OK as long as at most one of five packets was dropped.

A lesson learned from SNMP monitoring of F5 devices
My F5 BigIP devices began producing problems as soon as we set up the SNMP monitoring. Something like this:

Node /Common/drj-10_1_2_3 is not available in some capacity: blue (4)

It never seemed to matter until now that my nodes appear blue. But perhaps SNMP is enforcing a best practice and expecting nodes to not be blue, meaning to be monitored. And it turns out you can set up a default monitor for your nodes (I use gateway_icmp). It’s found in Nodes | Default Monitor. I’m not sure why this is not better documented by F5. After this, many legacy nodes turn red so I am cleaning them up… But my conclusion is that I have learned something about my own systems from the act of implementing this monitoring, and that’s a good thing.

To be continued…

References and related
A good commercial solution for infrastructure monitoring: Microfocus SiteScope.

DIY monitoring

The Zabbix manual

A nice online regular expression (RegEx) checker is: https://RegEx101.com/.

Another online regular expression checker is: https://RegExr.com/.

Just to put it out there: If you like Zabbix you may also like Specto. Specto is an open-source tool for monitoring web sites (“synthetic” monitoring). I know one major organization which uses it so it can’t be too bad. https://specto.sourceforge.net/

Since this document is such a mess I’m starting to document some of my interesting items and Zabbix tips in this newer and cleaner post. It includes the baseline calculation formula.

Categories
Admin Network Technologies

The IT Detective Agency: the case of the Is it the firewall? or routing? or switch? or layer 2?

Intro
This is yet another tale of things in the IT world often do not turn out the way it seems at first blush. Or possibly a tale of just when you think you’ve seen it all after decades in the industry, something new (to you) occurs.

What’s going on
The firewall team was all busy so when this strange problem occurred Friday they called in the second string: me. I consider some of the team to be less-than-customer focused so I try to compensate for them and for my lack of knowledge about the firewall by applying a more customer-first attitude. In other words, a sympathetic listening ear. These days it can be hard just to find someone to complain to about your It problem, and I am keenly aware of that.

There was some strange communication which wasn’t working, mediated by a firewall I had never accessed and was not sure i even had access to. So of course I was asked to join a big conference call where an ongoing debugging session was taking place.

I refused.

I hate being blindsided, and i hate not having answers, making me sound even less competent than I already am.

But what I did do is being my research to see what the system is, if perhaps I had access, etc.

Yes. I found that through a management system I have access to I had access to view the policies on that particular firewall and view the logs as well.

So once I had that up, I agreed to join the call.

They had one server communicating to three different systems. Only one of the three systems was being reached. Yes the other two were on the same subnet. Two of our firewalls were between the system and the three servers.

And, yes, i could see some drops. The interesting TCP error stated: TCP packet out of state, first packet isn’t SYN.

No problem. routing must be screwed up such that we have asymmetric routing. It happens all the time. Right? well these systems are really appliances with only some basic networking information configurable, not real debugging facility, and really no ability to add a host route.

I could not establish a shell session onto the firewall – not sure what the password naming scheme was that they used.

Then a real firewall guy comes on the call. But his connectivity is messed up, so I keep with the debug session, if nothing else than to support him since four eyes is more effective than just two. He shares the routing tables of our two in-line firewalls. It’s hard to understand as these are all new subnets for me, some are ones that don’t look right. But just focusing on possible host routes for any of these three servers, I don’t see anything amiss.

Firewall policy
And, in firewall policy I see the entire subnet has this traffic permitted. There is no rule specific to one or the other of these systems.

So what do we have up until now?
A purist firewall administrator attitude would be as follows:
The firewall treats all these systems the same, therefore this cannot be a firewall problem. Talk to your networking or system people. Have a nice day.

Well, in fact there was some serious question about the network switch as well. So we had a network guy on the call. So they dug up the MAC addresses of these systems, from which they found the switch ports. Then they checked the port configuration. Ah, some complex 802.1x authentication was configured. As I understand this means the device would not even be allowed onto the subnet until it passed some kind of Radius authentication. So they removed this 802.1x stuff and just made sure that port was assigned to the right vlan.

Still, the problem persisted.

I think the other firewall guy was also new to this equipment. Eventually, though, he tries to do a packet trace of the one that’s working versus the one that isn’t.

You know, I never saw the results of those traces, but I’m pretty sure, reading between the lines, that they surprised him, meaning, they did not fit the hypothesis of the asymmetric routing.

In these situations there is the main communication in the mian session, then side communications going on, like between me and the firewall guy. But it is all chaotic. Acoustics are mediocre, accents are hard to understand. So the net transfer of information is pretty low. Statements, even important ones, often have to be repeated multiple times (rebroadcasts) to assure everyone “gets it.”

Typical questions were asked. When did this last work? what had changed? There were a couple changes. Some kind of networking thing (I forget what), and then the firewalls changed management systems after that. The firewall change seemed closer in time to the last known success.

You acquire more and more information as you dig into problems. It’s hard to judge which is relevant at the time and which lines if inquiry are a complete waste of time. A good incident manager or project manager can sense which are the more productive lines of investigation and nurture those discussions while suppressing the noise.

Actually it was the networking guy who found the Checkpoint link below. I looked at it. the firewall guy was muttering something about badly behaved, older applications that might exhibit this behaviour.

So we agreed to take the suggested steps, which would basically allow these out of state packets. Drat. The firewall returned an error.

But I continue to refresh the firewall logs. The communication was occurring about every minute. Lo and behold, I see the older drops, and then accepts for the last few minutes! I think it worked. I tell them to check.

They check their end. Sure enough. Communication beginning to work…

The customer tries to make assertion that this was a firewall problem all along. Not so fast. Firewall guy says, well, the firewall is doing exactly what it’s supposed to be doing. who’s right?

We’re all good for now, but we state this is a kludge for today and a follow-up meeting needs to occur.

So what happened?
I think the single most important thing is that the firewall guy switched his problem hypothesis from Must be asymmetric routing, to Maybe it’s a badly behaved application. Meaning what? What if you have an application that establishes a TCP connection, and then to beat idle timeouts, sends a KEEP ALIVE packet every minute? Well, now, suppose your firewall is rebooted in the middle of that because it has changed management stations and needs to reload policy? What might the situation look like to it?

It you were unlucky, it just might see these KEEP ALIVE TCP packets without having the connection in its connection table, in other words, exactly the situation we are observing!

What should have happened?
It would have been great if the communication were forced to be re-established form time-to-time, even once a day. This problem had been going on for days.

But, given this very stupid behaviour on the part of this application, if the app people had been aware they should have forced their application to re-establish the TCP connection after the firewall reboot. Probably, for the one that did work, it had been forced to re-establish.

A firewall person has to be sufficiently aware to realize this could be happening, and advise the app owner on what to do to prevent it.

Conclusion
So whose problem is it?

To the app people it looks like a firewall issue, cut-and-dried. To a firewall guy it looks like an application issue, cut-and-dried. I see both sides. It is some of both. An app owner has to understand enough about firewalls to see that this type of thing can occur. Assigning blame to one side or the other, as most people are wont to do, is not productive. Only a team effort could have revealed this issue. And recall that the “fix” is actually a kludge that lowers security.

Case: almost closed.

References and related
Checkpoint’s note on TCP packet out of state first packet isn’t SYN: https://community.checkpoint.com/t5/General-Topics/TCP-packet-out-of-state-First-packet-isn-t-SYN-tcp-flags-SYN-ACK/td-p/37166

The IT Detective agency cases are still coming fast and furious. Here’s another recent case. Failed to convert character