Admin Firewall Proxy

Checkpoint SYN Defender: what you don’t know can hurt you


Our EDI group hails me last Friday and says they can’t reach their VANs, or at best intermittently. What to do, what to do… I go on the offensive and say they have to stop using FTP (and that’s literal FTP, not sftp, not FTPs, just plain old FTP), it’s been out of date for at least 15 years.

But that wasn’t really helping the situation, so I had to dig a lot deeper. And frankly, I was coincidentally having intermittent issues with my scripted speedtests. Could the two be related?

The details

To be continued, hopefully.

References and related

VAN: Value Added Network

Admin Network Technologies

The IT Dective Agency: someone stole my switch port


A complex environment produces some too-strange-to-be-true type of issues. Yesterday was one of those days. Let me try to set this up like a script from a play.

The setting

A non-descript server room somewhere in the greater New York City area.

The equipment

A generic security appliance we’ll call ThousandEyes PX, just to make up a name.

Cisco Nexus 7K plus a FEX

The players

Dr John – the protagonist

PCT – a generic network vendor

Florence Ranjard – an admin of ThousandEyes PX in France

Shake Abel – a server room resource in PA

Cloud Johnson – someone in Request Management

Bill Otto – a network guy at heart, forced to deal with his now vendor-managed network via ITIL

The processes

ITIL – look it up

Scene 1

An email from Dr John….

Hi Bill,


Well that’s messed up, as they say. I wouldn’t believe it if I hadn’t seen it for myself. Someone, “stole” our port and assigned it to a different device on a different vlan – despite the fact that it was in active use!

I guess I will try to “steal” it back, assuming I can find the IT Catalog article, or maybe with the help of Cloud.

Fortunately I have console access to the Fireeye. I artificially introduced traffic, which I see reflected in the port statistics. So I know the ThousandEyes is still connected to this port, despite the wrong vlan and description.


Dr John

Scene 2

One Week earlier

Siting at home due to Covid, Florence realizes she can no longer access the management port of her group’s ThousandEyes security appliance located in another continent. She beings to investigate and even contacts the vendor…

Scene 3

This exciting script is to be contiued, hopefully

Admin Linux Network Technologies

Speedtest automation: what they don’t tell you


I began to implement the autmoation of speedtest checks. I was running the jobs every 10 minutes, but we noticed something flaky in the results. On the hour and on the half hour the tests seemed to be garbage. What’s going on?

Our findings

Well, if you use a scheduler to run a speedtest every 10 minutes, it will start exactly on the hour and exactly on the half-hour, amongst other start times. We were running it eight times to test eight different paths. Only the last two were returning reliable results. The early ones were throwing errors. So I introduced an offset to run the jobs at 2,12,22,32,42,52 minutes. And with this offset, the results became much more reliable.

The inevitable conclusion is that too many other people are running tests exactly on the hour and half hour. A single run takes roughly 30 seconds to complete. And it must be that the servers which speedtest rely on are simply overwhelmed and refuse to do more tests.

References and related

There is a linux script written in python that implements the full speedtest. It really works, which is cool.

But as well there is a RPM package you can get from the speedtest site itself.

Admin Network Technologies

What is the one DHCP problem managed network providers never recognize?

Answer: the one where their switch eats the DHCPDISCOVER packets. And the zmzaing thing is they never learn. And the second amazing thing is that they actually don’t apply the most basic networing debugging techniques when such a problem occurs. I’m talking your basic, DHCPDISCOVER packet goes to yuor switch, same DHCPDISCOVER packet never arrives to the DHCP server on same switch. We know it to be the case, but, to help convince yourself that your switch is eating the pakcets, do networky things like create a span port of the DHCP server’s port to prove to yourself that no DHCP requests are coming in. And yet, they are never prepared to do that, to propose that. So instead indirect proxies are used to draw the conclusion.

I’ve been involved in three of four such debugging sessions. They take hours. I took notes when it happened again this weekend. I guess that setup is pretty typical of how it plays out. A data center was moved, including a DHCP server. The new data center has a MAN network to the old one. All IPs were preserved. When they turned on the moved DHCP server DHCP lease were no longer getting handed out. In fact it was worse than that. With the moved DHCP sever turned off, most DHCP leases were working. But with it on, that’s when things really began to go south!

Here’s the switch port they noted for the iDRAC:

sh run int gi1/0/24
Building configuration…
Current configuration : 233 bytes
interface GigabitEthernet1/0/24
description --- To-cnshis01-iDRAC - iDRAC
switchport access vlan 202
switchport mode access
logging event link-status
speed 100
duplex full
spanning-tree portfast
ip dhcp snooping trust

The first line of course if the IOS command. OK, so they had that on the iDRAC, right. But on the actual server port they had this:

sh run int gi1/0/23
Building configuration…
Current configuration : 258 bytes
interface GigabitEthernet1/0/23
description --- To-cnshis01-Gb1 - Gb1
switchport access vlan 202
switchport mode access
logging event link-status
spanning-tree portfast
service-policy input PMAP_COS_REMARK_IN
service-policy output PMAP_COS_OUT

I basically told them cheekily up front that this is usually a network switch problem and that they have to play with the DHCP snooping enable setting.

And I have to say that the usual hours of debugging were short-circuited this time as they seemed to believe me, and simply experimented by adding

ip dhcp snooping trust

to the DHCP server’s main port. We immediately began seeing DHCPDISCOVER pakcets come in to the DHCP server, and the team testified that people were getting leases.

Final mystery explained

Now why were things behaving really badly – no leases – when the DHCP server was up but no DHCPDISCOVER requests were getting to it? I have the explanation for that as well. You see theer is a standby DHCP server which is designed for failure of the primary DHCP server. But not for this type of failure! That’s right. There is an out-of-band (by that I mean not carried over DHCP ports like UDP port 67) communication between standby and primary which tells the standby Hey, although you got this DHCPDISCOVER request, ignore it becasue the primary is active and will serve it! And meanwhile, as we have said, the primary wasn’t getting the requests at all. Upshot: no one gets leases.

Just to mention it

My second-to-last debugging session of this sort was a little different. There they mentioned that there was a “global setting” which governed this DHCP snooping on the switch. So they had to do something with that (enable or disable or something). So there was no issue with the individual switch ports. For me that’s just a variation on the same theme.

What’s the idea behind this feature?

Having done a total of zero minutes of research on the topic, I will anyway weigh in with my opinion! Suppose someone comes along and plugs in a consumer grade home router into your network. It’s probably going to act as a rogue DHCP server. Imagine the fun trying to debug that situation? We’ve all been there… These rogue devices are probably fairly common. So if your corporate switch doesn’t suppress certain DHCP packets from ports where they are not expected, then this rogue device will begin to take down your subnet and totally bewilder everyone. I imagine this setting that is the topic of this blog post stems from trying to suppress all unknown DHCP packets in advance. Its just that sometimes the setting is taken too far and, e.g., a firewall which relays DHCP requests is also getting its DHCP packets suppressed.


I normally would have presented this as part of my IT Detective series. But I feel this is more like a lament about the sad state of affairs with our network providers. And though I’ve seen this issue about four times in the past 12 months, they always act like they have no idea what we’re talking about. They’ve never encountered this problem. They have no idea how to fix it. And they have no idea how to further debug it.. What steps does the customer wish?

References and related

Juat because I mentioned it, here’s on of those IT Detective Agency blog posts: The IT Detecive Agency: web site not accessible

Admin Web Site Technologies

TCL iRule program with comments for F5 BigIP


A publicity-adverse colleague of mine wrote this amazing program. I wanted to publish it not so much for what it specifically does, but as well for the programming techniques it uses. I personally find i relatively hard to look up concepts when using TCL for an F5 iRule.

Program Introduction


# RULE_INIT is executed once every time the iRule is saved or on reboot. So it is ideal for persistent data that is shared accross all sessions.
# In our case it is used to define a template with some variables that are later substituted

when RULE_INIT {
# "static" variables in iRules are global and read only. Unlike regular TCL global variables they are CMP-friendly, that means they don't break the F5 clustered multi-processing mechanism. They exist in memory once per CMP instance. Unlike regular variables that exist once per session / iRule execution. Read more about it here:
# One thing to be careful about is not to define the same static variable twice in multiple iRules. As they are global, the last iRule saved overwrites any previous values.
# Originally the idea was to load an iFile here. That's also the main reason to even use RULE_INIT and static variables. The reasoning was (and I don't even know if this is true), that loading the iFile into memory once would have to be more efficient than to do it every time the iRule is executed. However, it is entirely possible that F5 already optimized iFiles in a way that loads them into memory automatically at opportune times, so this might be completely unnecessary.
# Either way, as you can tell, in the end I didn't even use iFiles. The reason for that is simply visibility. iFiles can't be easily viewed from the web UI, so it would be quite inconvenient to work with.
# The template idea and the RULE_INIT event stayed, even though it doesn't really serve a purpose, except maybe visually separating the templates from the rest of the code.
# As for the actual content of the variable: First thing to note is the use of  {} to escape the entire string. Works perfectly, even though the string itself contains braces. TCL magic.
# The rest is just the actual PAC file, with strategically placed TCL variables in the form of $name (this becomes important later)

            set static::pacfiletemplate {function FindProxyForURL(url, host)
            var globalbypass = "$globalbypass";
            var localbypass = "$localbypass";
            var ceglobalbypass = "$ceglobalbypass";
            var zpaglobalbypass = "$zpaglobalbypass";
            var zscalerbypassexception = "$zscalerbypassexception";

            var bypass = globalbypass.split(";").concat(localbypass.split(";"));
            var cebypass = ceglobalbypass.split(";");
            var zscalerbypass = zpaglobalbypass.split(";");
            var zpaexception = zscalerbypassexception.split(";");

            if(isPlainHostName(host)) {
                        return "DIRECT";

            for (var i = 0; i < zpaexception.length; ++i){
                        if (shExpMatch(host, zpaexception[i])) {
                                   return "PROXY $clientproxy";

            for (var i = 0; i < zscalerbypass.length; ++i){
                        if (shExpMatch(host, zscalerbypass[i])) {
                                   return "DIRECT";

            for (var i = 0; i < bypass.length; ++i){
                        if (shExpMatch(host, bypass[i])) {
                                   return "DIRECT";

            for (var i = 0; i < cebypass.length; ++i) {
                        if (shExpMatch(host, cebypass[i])) {
                                   return "PROXY $ceproxy";

            return "PROXY $clientproxy";

            set static::forwardingpactemplate {function FindProxyForURL(url, host)
            var forwardinglist = "$forwardinglist";
            var forwarding = forwardinglist.split(";");

            for (var i = 0; i < forwarding.length; ++i){
                        if (shExpMatch(host, forwarding[i])) {
                                   return "PROXY $clientproxy";

            return "DIRECT";

# Now for the actual code (executed every time a user accesses the vserver)
    # The request URI can of course be used to differentiate between multiple PAC files or to restrict access.
    # So can basically any other request attribute. Client IP, host, etc.
            if {[HTTP::uri] eq "/proxy.pac"} {

                        # Here we set variables with the exact same name as used in the template above.
                        # In our case the values come from a data group, but of course they could also be defined
                        # directly in this iRule. Using data groups makes the code a bit more compact and it
                        # limits the amount of times anyone needs to edit the iRule (potentially making a mistake)
                        # for simple changes like adding a host to the bypass list
                        # These variables are all set unconditionally. Of course it is possible to set them based
                        # on for example client IP (to give different bypass lists or proxy entries to different groups of users)
                        set globalbypass [ class lookup globalbypass ProxyBypassLists ]
                        set localbypass [ class lookup localbypassEU ProxyBypassLists ]
                        set ceglobalbypass [ class lookup ceglobalbypass ProxyBypassLists ]
                        set zpaglobalbypass [ class lookup zpaglobalbypass ProxyBypassLists ]
                        set zscalerbypassexception [ class lookup zscalerbypassexception ProxyBypassLists ]
                        set ceproxy [ class lookup ceproxyEU ProxyHosts ]

                        # Here's a bit of conditionals, setting the proxy variable based on which virtual server the
                        # iRule is currently executed from (makes sense only if the same iRule is attached to multiple
                        # vservers of course)
                        if {[virtual name] eq "/Common/proxy_pac_http_90_vserver"} {
                            set clientproxy [ class lookup formauthproxyEU ProxyHosts ]
                        } elseif {[virtual name] eq "/Common/testproxy_pac_http_81_vserver"} {
                            set clientproxy [ class lookup testproxyEU ProxyHosts]
                        } elseif {[virtual name] eq "/Common/proxy_pac_http_O365_vserver"} {
                            set clientproxy [ class lookup ceproxyEU ProxyHosts]
                        } else {
                            set clientproxy [ class lookup clientproxyEU ProxyHosts ]

                        # Now this is the actual magic. As noted above we have now set TCL variables named for example
                        # $globalbypass and our template includes the string "$globalbypass"

                        # What we want to do next is substitute the variable name in the template with the variable values
                        # from the code.
                        # "subst" does exactly that. It performs one level of TCL execution. Think of "eval" in basically
                        # any language. It takes a string and executes it as code.
                        # Except for "subst" there are two in this context very useful parameters: -nocommands and -nobackslashes.
                        # Those prevent it from executing commands (like if there was a ping or rm or ssh or find or anything
                        # in the string being subst'd it wouldn't actually try to execute those commands) and from normalizing
                        # backslashes (we don't have any in our PAC file, but if we did, it would still work).
                        # So what is left that it DOES do? Substituting variables! Exactly what we want and nothing else.
                        # Now since the static variable is read only, we can't do this substitution on the template itself.
                        # And if we could it wouldn't be a good idea, because it is shared accross all sessions. So assuming
                        # there are multiple versions of the PAC file with different proxies or bypass lists, we would
                        # constantly overwrite them with each other.
                        # The solution is simply to save the output of the subst in a new local variable that exists in
                        # session context only.
                        # So from a memory point of view the static/global template doesn't really gain us anything.
                        # In the end we have the template in memory once per CMP and then a substituted copy of the template
                        # once per session. So as noted earlier, could've probably just removed the entire RULE_INIT block,
                        # set the template in session context (HTTP_REQUEST event) and get the same result,
                        # maybe even slightly more efficient.
                        set pacfile [subst -nocommands -nobackslashes $static::pacfiletemplate]

                        # All that's left to do is actually respond to the client. Simple stuff.
                        HTTP::respond 200 content $pacfile "Content-Type" "application/x-ns-proxy-autoconfig" "Cache-Control" "private,no-cache,no-store,max-age=0"
            # In this example we have two different PAC files with different templates on different URLs
            # Other iRules we use have more differentiation based on client IP. In theory we could have one big iRule
            # with all the PAC files in the world and it would still scale very well (just a few more if/else or switch cases)
            } elseif { [HTTP::uri] eq "/forwarding.pac" } {
                set clientproxy [ class lookup clientproxyEU ProxyHosts]
                set forwardinglist [ class lookup forwardinglist ProxyBypassLists ]
            set forwardingpac [subst -nocommands -nobackslashes $static::forwardingpactemplate]
            HTTP::respond 200 content $forwardingpac "Content-Type" "application/x-ns-proxy-autoconfig" "Cache-Control" "private,no-cache,no-store,max-age=0"
            } else {
                # If someone tries to access a different path, give them a 404 and the right URL
                HTTP::respond 404 content "Please try" "Content-Type" "text/plain" "Cache-Control" "private,no-cache,no-store,max-age=0"

To be continued...

Admin Web Site Technologies

Building a regular (non-bloggy) web site with WordPress


I recently was a first-hand witness to the building of a couple web sites. I was impressed as the webmaster turned them into “regular” web sites – some bit of marketing, some practical functionality – and removed all the traditional blog components. Here are some of the ingredients.

The ingredients

Background images and logo – a place to look for quality, non-copyrighted images on a variety of topics. These can serve as a background image to the home page for instance. – a place to do your logo design.



Security Plugins

WPS Hide Login

Layout Plugins


Envato Elements

Form Plugins

Contact Form 7

Contact Form 7 Captcha

Ninja Forms. Note that Ninja Forms 3 includes Google’s reCAPTCHA, so no need to get that as a separate plugin. I am trying to work with Ninja Forms for my contact form.

Infrastructure Plugins

WP Mail SMTP – my WordPress server needs this but your mileage may vary.

How-to videos

I don’t have this link yet.

Reference and related

To sign up for an API key for Google’s reCAPTCHA, go here:


OpenSCAD export to STL does nothing

Quick Tip
If you are using OpenSCAD for your 3D model construction, and after creating a satisfactory model do an export to STL, you may observe that nothing at all happens!

I was stuck on this problem for awhile. Yes, the solution is obvious for a regular users, but I only use it every few months. If you open the console you will see the problem immediately:

ERROR: Nothing to export! Try rendering first (press F6).

But in my case I had closed the console, forgot there was such a thing, and of course it remembers your settings.

So you have to render your object (F6) before you can export as an STL file.

References and related
I don’t know why this endplate design blog post which I wrote never caught on. I think the pictures are cool.

OpenSCAD is a 3D modelling application that uses CSG – constructive Solid Geometry. It’s very math and basic geometric shapes focussed – perfect for me.

Admin Linux

vsftd Virtual Users stopped working after patching: the solution

vsftpd is a useful daemon which I use to run an ftps service (ftp which uses TLS encryption). Since I am not part of the group that administers the server, it makes sense for me to maintain my own userlist rather than rely on the system password database. vsftpd has a convenient feature which allows this known as virtual users.

More details
In /etc/pam.d/vsftpd.virtual I have:

auth required db=/etc/vsftpd/vsftpd-virtual-user
account required db=/etc/vsftpd/vsftpd-virtual-user
session required

In the file /etc/vsftpd-virtual-user.db I have my Berkeley database of users and passwords. See references on how to set this up.

The point is that I had this all working last year – 2019 – on my SLES 12SP4 server.

Then it all broke
Then in early May, 2020, all the FTPs stopped working. The status of the vsftpd service hinted that the file /lib64/security/ could not be loaded. Sure enough, it was missing! I checked some of my other SLES12SP4 servers, some of which are on a different patch schedule. It was missing on some, and present on one. So I “borrowed” from the one server which still had it and put it onto my server in /lib64/security. All good. Service restored. But clearly that is a hack.

What’s going on
So I asked a Linux expert what’s going on and got a good explanation.

pam_userdb has been moved to a separate package, named pam-extra
Advisory ID: SUSE-RU-2020:917-1
Released: Fri Apr 3 15:02:25 2020
Summary: Recommended update for pam
Type: recommended
Severity: moderate
References: 1166510
This update for pam fixes the following issues:
- Moved pam_userdb into a separate package pam-extra. (bsc#1166510)
Installing the package pam-extra should resolve your issue.

I installed the pam-extra package using zypper, and, yes, it creates a /lib64/security/ file!

And vsftpd works once more using supported packages.

Virtual users with vsftpd requires However, PAM wished to decouple itself from dependency on external databases, etc, so they bundled this kind of thing into a separate package, pam-extra, more-or-less in the middle of a patch cycle. So if you had the problem I had, the solution may be as simple as installing the pam-extra package on your system. Although I experienced this on SLES, I believe it has or will happen on other Linux flavors as well.

This problem is poorly documented on the Internet.

References and related

Admin Linux Network Technologies

Configure rsyslog to send syslog to SIEM server running TLS

You have to dig a little to find out about this somewhat obscure topic. You want to send syslog output, e.g., from the named daemon, to a syslog server with beefed up security, such that it requires the use of TLS so traffic is encrypted. This is how I did that.

The details
This is what worked for me:

# DrJ fixes - log local0 to DrJ's dmz syslog server - DrJ 5/6/20
# use local0 for named's query log, but also log locally
# see
# @@ means use TCP
$DefaultNetstreamDriver gtls
$DefaultNetstreamDriverCAFile /etc/ssl/certs/GlobalSign_Root_CA_-_R3.pem
$ActionSendStreamDriver gtls
$ActionSendStreamDriverMode 1
$ActionSendStreamDriverAuthMode anon
local0.*                                @@(o)
#local0.*                               /var/lib/named/query.log
local1.*                                -/var/log/localmessages
#local0.*;local1.*                      -/var/log/localmessages
local2.*;local3.*                       -/var/log/localmessages
local4.*;local5.*                       -/var/log/localmessages
local6.*;local7.*                       -/var/log/localmessages

The above is the important part of my /etc/rsyslog.conf file. The SIEM server is running at IP address on TCP port 6514. It is using a certificate issued by Globalsign. An openssl call confirms this (see references).

Other gothcas
I am running on a SLES 15 server. Although it had rsyslog installed, it did not support tls initially. I was getting a dlopen error. So I figured out I needed to install this module:


References and related
How to find the server’s certificate using openssl. Very briefly, use the openssl s_client submenu.

The rsyslog site itself probably has the complete documentation. Though I haven’t looked at it thoroughly, it seems very complete.

Admin JavaScript Network Technologies

Practical Zabbix examples

I share some Zabbix items I’ve had to create which I find useful.

Convert DateAndTime SNMP output to human-readable format

Of course this is not very Zabbix-specific, as long as you realize that Zabbix produces the outer skin of the function:

function (value) {
// DrJ 2020-05-04
// see for SNMP DateAndTime format
'use strict';
//var str = "07 E4 05 04 0C 32 0F 00 2B 00 00";
var str = value;
// alert("str: " + str);
// read values are hex
var y256 = str.slice(0,2); var y = str.slice(3,5); var m = str.slice(6,8); 
var d = str.slice(9,11); var h = str.slice(12,14); var min = str.slice(15,17);
// convert to decimal
var y256Base10 = +("0x" + y256);
// convert to decimal
var yBase10 = +("0x" + y);
var Year = 256*y256Base10 + yBase10;
//  alert("Year: " + Year);
var mBase10 = +("0x" + m);
var dBase10 = +("0x" + d);
var hBase10 = +("0x" + h);
var minBase10 = +("0x" + min);
var YR = String(Year); var MM = String(mBase10); var DD = String(dBase10);
var HH = String(hBase10);
var MIN = String(minBase10);
// padding
if (mBase10 &lt; 10)  MM = "0" + MM; if (dBase10 &lt; 10) DD = "0" + DD;
if (hBase10 &lt; 10) HH = "0" + HH; if (minBase10 &lt; 10) MIN = "0" + MIN;
var Date = YR + "-" + MM + "-" + DD + " " + HH + ":" + MIN;
return Date;

I put that javascript into the preprocessing step of a dependent item, of course.

All my real-life examples do not fill in the last two fields: +/-, UTC offset. So in my case the times must be local times. But consequently I have no idea how a + or – would be represented in HEX! So I just ignored those last fields in the SNNMP DateAndTime which otherwise might have been useful.

Here’s an alternative version which calculates how long its been in hours since the last AV signature update.

// DrJ 2020-05-05
// see for SNMP DateAndTime format
'use strict';
//var str = "07 E4 05 04 0C 32 0F 00 2B 00 00";
var Start = new Date();
var str = value;
// alert("str: " + str);
// read values are hex
var y256 = str.slice(0,2); var y = str.slice(3,5); var m = str.slice(6,8); var d = str.slice(9,11); var h = str.slice(12,14); var min = str.slice(15,17);
// convert to decimal
var y256Base10 = +("0x" + y256);
// convert to decimal
var yBase10 = +("0x" + y);
var Year = 256*y256Base10 + yBase10;
//  alert("Year: " + Year);
var mBase10 = +("0x" + m);
var dBase10 = +("0x" + d);
var hBase10 = +("0x" + h);
var minBase10 = +("0x" + min);
var YR = String(Year); var MM = String(mBase10); var DD = String(dBase10);
var HH = String(hBase10);
var MIN = String(minBase10);
var Sigdate = new Date(Year, mBase10 - 1, dBase10,hBase10,minBase10);
//difference in hours
var difference = Math.trunc((Start - Sigdate)/1000/3600);
return difference;

Calculated bandwidth from an interface that only provides byte count
Again in this example the assumption is you have an item, probably from SNMP, that lists the total inbound/outbound byte count of a network interface – hopefully stored as a 64-bit number to avoid frequent rollovers. But the quantity that really excites you is bandwidth, such as megabits per second.

Use a calculated item as in this example for Bluecoat ProxySG:


Give it type numeric, Units of mbps. sgProxyInBytesCount is the key for an SNMP monitor that uses OID


where {$INTERFACE_TO_MEASURE} is a macro set for each proxy with the SNMP-reported interface number that we want to pull the statistics for.

The 300 in the denominator of the calculated item is required for me because my item is run every five minutes.

No one really cares about the actual total value of byte count, right? So just re-purpose the In Bytes Count item a bit as follows:

  • add preprocessing step: Change per second
  • add second preprocessing step, Custom multiplier 8e-6

The first step gives you units of bytes/second which is less interesting than mbps, which is given by the second step. So the final units are mbps.

Be sure to put the units as !mbps into the Zabbix item, or else you may wind up with funny things like Kmbps in your graphs!

Creating a baseline

Even as of Zabbix v 5, there is no built-in baseline item type, which kind of sucks. Baseline can mean many different things to many people – it really depends on the data. In the corporate world, where I’m looking at bandwidth, my data has these distinct characteristics:

  • varies by hour-of-day, e.g., mornings see heavier usage than afternoons
  • there is the “Friday effect” where somewhat less usage is seen on Fridays, and extremely less usage occurs on weekends, hence variability by day-of-week
  • probably varies by day of month, e.g., month-end closings

So for this type of data (except the last criterion) I have created an appropriate baseline. Note I would do something different if I were graphing something like the solar generation from my solar panels, where the day-of-week variability does not exist.

Getting to the point, I have created a rolling lookback item. This needs to be created as a Zabbix Item of type Calculated. The formula is as follows:


In this example sgProxyInBytesCount is my key from the reference item. Breaking it down, it does a rolling lookback of the last six measurements taken at this time of day on this day of the week over the last six weeks and averages them. Voila, baseline! The more weeks you include the more likely you are to include data you’d rather not like holidays, days when things were busted, etc. I’d like to have a baseline that is from a fixed time, like “all of last year.” I have no idea how. I actually don’t think it’s possible.

But, anyway, the baseline approach above should generally work for any numeric item.


The above approach only gives you six measurements, hence 1/sqrt(6) ~ 40% standard deviation by the law of large numbers, which is still pretty jittery as it turns out. So I came up with this refined approach which includes 72 measurements, hence 1/sqrt(72) ~ 12% st dev. I find that to be closer to what you intuitively expect in a baseline – a smooth approximation of the past. Here is the refined function:


I would have preferred a one-hour interval centered around one week ago, etc., e.g., something like 1w+30m, but such date arithmetic does not seem possible in Zabbix functions. And, yeah, I could put 84600s (i.e., 86400 – 1800), but that is much less meaingful and so harder to maintain. Here is a three-hour graph whose first half still reflects the original (jittery) baseline, and last half the refined function.

Latter part has smoothed baseline in light green

What I do not have mastered is whether we can easily use a proper smoothing function. It does not seem to be a built-in offering of Zabbix. Perhaps it could be faked by a combination of pre-processing and Javascript? I simply don’t know, and it’s more than I wish to tackle for the moment.

Data gap between mulitple item measurements looks terrible in Dashboard graph – solution

In a Dashboard if you are graphing items which were not all measured at the same time, the results can be frustrating. For instance, an item and its baseline as calculated above. The central part of the graph will look fine, but at either end giant sections will be missing when the timescale of display is 30 minutes or 60 minutes for items measured every five minutes or so. Here’s an example before I got it totally fixed.

Zabbix item timing mismatch

See the left side – how it’s broken up? I had beguin my fix so the right side is OK.

The data gap solution

Use Scheduling Intervals in defining the items. Say you want a measurement every five minutes. Then make your scheduling interval m/5 in all the items you are putting on the same graph. For good measure, make the regular interval value infrequent. I use a macro {$UPDATE_LONG}. What this does is force Zabbix to measure all the items at the same time, in this case every five minutes on minutes divisible by five. Once I did that my incoming bandwith item and its corresponding baseline item aligned nicely.

Low-level Discovery

I cottoned on to the utility of this part of Zabbix a little late. Hey, slow learner, but I eventually got there. What I found in my F5 devices is that using SNMP to monitor the /var filesystem was a snap: it was always device 32 (final OID digit). But /var/log monitoring? Not so much. Every device seemed different, with no obvious pattern. Active and standby units – identical hardware – and some would be 53, the partner 55. Then I rebooted a device and its number changed! So, clearly, dynamically assigned and no way was I going to keep up with it. I had learned the numbers by doing an snmpwalk. The solution to this dynamically changing OID number is to use low-level discovery.

A couple of really useful but poorly documented items are shared. Perhaps more will be added in the future.

References and related for SNMP DateAndTime format

My first Zabbix post was mostly documentation of a series of disasters and unfinished business.

Blog post about calculated items by a true expert:

Low-level Discovery write-up: