Categories
Visibility

Everything I need to know about Influxdb, Grafana and Flux

Intro

Some of my colleagues had used Influxdb and Grafana at their previous job so they thought it might fit for what we’re doing in the Visibility team. It sounded good in theory, anyway, so I had to agree. There were a lot of pitfalls. Eventually I got it to the point where I’m satisfied with my accomplishments and want to document the hurdles I’ve overcome.

So as time permits I will be fleshing this out.

Grafana

I’m going to lead with the picture and then the explanation makes a lot more sense.

I’ve spent the bulk of my time wrestling with Grafana. Actually it looks like a pretty capable tool. It’s mostly just understanding how to make it do what you are dreaming about. Our installed version currently is 9.2.1.

My goal is to make a heatmap. But a special kind similar to what I saw the network provider has. That would namely entail one vedge per row, and one column per hour, hence, 24 columns in total. A vedge is a kind of SD-Wan router. I want to help the networking group look at hundreds of them at a time. So that’s on potential dashboard. It would give a view of a day. Another dashboard would show just one router with the each row representing a day, and the columns again showing an hour. Also a heatmap. The multi-vedge dashboard should link to the individual dashboard, ideally. In the end I pulled it off. I am also responsible for feeding the raw data into Influxdb and hence also for the table design.

Getting a workable table design was really imporant. I tried to design it in a vacuum, but that only partially worked. So I revised, adding tags and fields as I felt I needed to, while being mindful of not blowing up the cardinality. I am now using these two tables, sorry, measurements.

vedge measurement
vedge_stats measurement

Although there are hundreds of vedges, some of my tags are redundant, so don’t get overly worried about my high cardinality. UTChour is yes a total kludge – not the “right” way to do things. But I’m still learning and it was simpler in my mind. item in the first measurement is redundant with itemid. But it is more user-friendly: a human-readable name.

Influx Query Explorer

It’s very helpful to use the Explorer, but the synatx there is not exactly the same as it will be when you define template variables. Go figure.

Multiple vedges for the last day

So how did I do it in the end?

Mastering template variables is really key here. I have a drop-down selection for region. In Grafana-world it is a custom variable with potential values EU,NA,SA,AP. That’s pretty easy. I also have a threshold variable, with possible values: 0,20,40,60,80,90,95. And a math variable with values n95,avg,max. More recently I’ve added a threshold_max and a math_days variable.

It gets more interesting however, I promise. I have a category variable which is of type query:

from(bucket: "poc_bucket2")
|> range (start: -1d)
|> filter(fn:(r) => r._measurement == "vedge_stats")
|> group()
|> distinct(column: "category")

The distinct function eliminates rows with identical values. This can be useful for creating an iterator!

Multi-value and Include all options are checked. Just to make it meaningful, category is assigned by the WAN provider and has values such as Gold, Silver, Bronze.

And it gets still more interesting because the last variable depends on the earlier ones, hence we are using chained variables. The last variable, item, is defined thusly:

from(bucket: "poc_bucket2")
|> range (start: -${math_days}d)
|> filter(fn:(r) => r._measurement == "vedge_stats" and r.region == "${Region}")
|> filter(fn:(r) => contains(value: r.category, set: ${category:json}))
|> filter(fn:(r) => r._field == "${math}" and r._value >= ${threshold} and r._value <= ${threshold_max})
|> group()
|> distinct(column: "item")

So what it is designed to do is to generate a list of all the items, which in reality are particular interfaces of the vedges, into a drop-down list.

Note that I want the user to be able to select multiple categories. It’s not well-documented how to chain such a variable, so note the use of contains and set in that one filter function.

And note the double-quotes around ${Region}, another chained variable. You need those double-quotes! It kind of threw me because in Explorer I believe you may not need them.

And all that would be simply nice if we didn’t somehow incorporate these template variables into our panels. I use the Stat visualization. So you’ll get one stat per series. That’s why I artifically created a tag UTChour, so I could easily get a unique stat box for each hour.

The stat visualization flux Query

Here it is…

data = from(bucket: "poc_bucket2")
  |> range(start: -24h, stop: now())
  |> filter(fn: (r) =>
    r._measurement == "vedge" and
    r._field == "percent" and r.hostname =~ /^${Region}/ and r.item == "${item}"
  )
  |> drop(columns: ["itemid","ltype","hostname"])
data

Note I hae dropped my extra tags and such which I do not wish to appear during a mouseover.

Remember our regions can be one of AP,EU,NA or SA? Well the hostnames assigned to each vedge start with the two letters of its region of location. Hence the regular explression matching works there to restrict consideration to just the vedges in the selected region.

We are almost done.

Making it a heat map

So my measurement has a tag called percent, which is the percent of available bandwidth that is being used. So I created color-based thresholds:

Colorful percent-based thresholds

You can imagine how colorful the dashboard gets as you ratchet up the threshold template variable. So the use of these thresholds is what turns our stat squares into a true heatmap.

Heatmap visualization

I found the actual heatmap visualization useless for my purposes, by the way!

There is also an unsupported heatmap plugin for Grafana which simply doesn’t work. Hence my roll-your-own approach.

Repetition

How do we get a panel row per vedge? The stat visualization has a feature called Repeat Options. So you repeat by variable. The variable selected is item. Remember that item came from our very last template variable. Repeat direction is Vertical.

For calculation I choose mean. Layout orienttion is Vertical.

The visualization title is also variable-driven. It is ${item} .

The panels are long and thin. Like maybe two units high? – one unit for the label (the item) and the one below it for the 24 horizontal stat boxes.

Put it all together and voila, it works and it’s cool and interactive and fast!

Single vedge heatmap data over multiple days

Of course this is very similar to the multiple vedge dashboard. But now we’re drilling down into a single vedge to look at its usage over a period of time, such as the last two weeks.

Flux query
import "date"
b = date.add(d: 1d, to: -${day}d)
data = from(bucket: "poc_bucket2")
  |> range(start: -${day}d, stop: b)
  |> filter(fn: (r) =>
    r._measurement == "vedge" and
    r._field == "percent" and
    r.item == "$item"
  )
  |> drop(columns:["itemid","ltype","hostname"])
data
Variables

As before we have a threshold, Region and category variable with category derived from the same flux query shown above. A new variable is day, which is custom and hidden, It has values 1,2,3,4,…,14. I don’t know how to do a loop in flux or I might have opted a more elegant method to specify the last 14 days.

I did the item variable query a little different, but I think it’s mostly an alternate and could have been the same:

from(bucket: "poc_bucket2")
|> range (start: -${math_days}d)
|> filter(fn:(r) => r._measurement == "vedge_stats" and r.region == "${Region}")
|> filter(fn:(r) => contains(value: r.category, set: ${category:json}))
|> filter(fn:(r) => r._field == "${math}" and r._value >= ${threshold} and r._value <= ${threshold_max})
|> group()
|> distinct(column: "item")

Notice the slightly different handling of Region. And those double-quotes are important, as I learned from the school of hard knocks!

The flux query in the panel is of course different. It looks like this:

import "date"
b = date.add(d: 1d, to: -${day}d)
data = from(bucket: "poc_bucket2")
  |> range(start: -${day}d, stop: b)
  |> filter(fn: (r) =>
    r._measurement == "vedge" and
    r._field == "percent" and
    r.item == "$item"
  )
  |> drop(columns:["itemid","ltype","hostname"])
data

So we’re doing some date arithmetic so we can get panel strips, one per day. These panels are long and thin, same as before, but I omitted the title since it’s all the same vedge.

The repeat options are repeat by variable day, repeat direction Vertical as in the other dashboard. The visualization is Stat, as in the other dashboard.

And that’s about it! Here the idea is that you play with the independent variables such as Region and threshold, it generates a list of matching vedge interfaces and you pick one from the drop-down list.

Linking the multiple vedge dashboard to the single vedge history dashboard

Of course the more interactive you make these things the cooler it becomes, right? I was excited to be able to link these two dashboards together in a sensible way.

In the panel config you have Data links. I found this link works:

https://drjohns.com:3000/d/1MqpjD24k/single-vedge-usage-history?orgId=1&var-threshold=60&var-math=n95&${item:queryparam}

So to generalize since most of the URL is specific to my implementation, both dashboards utilize the item variable. I basically discovered the URL for a single vedge dashboard and dissected it and parameterized the item, getting the syntax right with a little Internet research.

So the net effect is that when you hover over any of the vedge panels in the multi-vedge dashboard, you can click on that vedge and pull up – in a new tab in my case – the individual vedge usage history. It’s pretty awesome.

Passing the start and stop time range in the link

I didn’t need it for this one, but in another project I wanted the user to use time selection and then be able to get details where their time seleection was preserved. So… I found that adding these additional variables to the link did the job:

&from=${__from}&to=${__to}

It’s a little ugly because the time no longer displays as last 24 hours, for instance. But oh well…

Influxdb

Influxdb is a time series database. It takes some getting used to. Here is my cheat sheet which I like to refer to.

  • bucket is named location with retention policy where time-series data is stored.
  • series is a logical grouping of data defined by shared measurement, tag and field.
  • measurement is similar to an SQL database table.
  • tag is similar to indexed columns in an SQL database.
  • field is similar to unindexed columns in an SQL database.
  • point is similar to SQL row.

This is not going to make a lot of sense to anyone who isn’t Dr John. But I’m sure I’ll be referring to this section for a How I did this reminder.

OK. So I wrote a feed_influxdb.py script which runs every 12 minutes in an Azure DevOps pipeline. It extracts the relevant vedge data from Zabbix using the Zabbix api and puts it into my influxdb measurement vedge whose definition I have shown above. I would say the code is fairly generic, except that it relies on the existence of a master file which contains all the relevant static data about the vedges such as their interface names, Zabbix itemids, and their maximum bandwidth (we called it zabbixSpeed). You could pretty much deduce the format of this master file by reverse-engineering this script. So anyway here is feed_influxdb.py.

from pyzabbix import ZabbixAPI
import requests, json, sys, os, re
import time,datetime
from time import sleep
from influxdb_client import InfluxDBClient, Point, WritePrecision
from influxdb_client.client.write_api import SYNCHRONOUS
from modules import aux_modules,influx_modules

# we need to get data out of Zabbix
inventory_file = 'prod.config.visibility_dashboard_reporting.json'
#inventory_file = 'inv-w-bw.json' # this is a modified version of the above and includes Zabbix bandwidth for most ports
# Login Zabbix API - use hidden variable to this pipeline
token_zabbix = os.environ['ZABBIX_AUTH_TOKEN']
url_zabbix = 'https://zabbix.drjohns.com/'
zapi = ZabbixAPI(url_zabbix)
zapi.login(api_token=token_zabbix)
# Load inventory file
with open(inventory_file) as inventory_file:
    inventory_json = json.load(inventory_file)
# Time range which want to get data (unixtimestamp)
inventory_s = json.dumps(inventory_json)
inventory_d = json.loads(inventory_s)
time_till = int(time.mktime(datetime.datetime.now().timetuple()))
time_from = int(time_till - 780)  # about 12 minutes plus an extra minute to reflect start delay, etc
i=0
max_items = 200
item_l = []
itemid_to_vedge,itemid_to_ltype,itemid_to_bw,itemid_to_itemname = {},{},{},{}
gmtOffset_d = {}
for SSID in inventory_d:
    print('SSID',SSID)
    hostname_d = inventory_d[SSID]['hostname']
    gmtOffset = aux_modules.gmtOffset_calc(inventory_d[SSID])
    gmtOffset_d[SSID] = gmtOffset
    for vedge_s in hostname_d:
        print('vedge_s',vedge_s,flush=True)
        items_l = hostname_d[vedge_s]
        for item_d in items_l:
            print('item_d',item_d,flush=True)
            itemname = item_d['itemname']
            if not 'lineType' in item_d: continue # probably SNMP availability or something of no interest to us
            lineType = item_d['lineType']
            if 'zabbixSpeed' in item_d:
                bandwidth = int(item_d['zabbixSpeed'])
            else:
                bandwidth = 0
            itemid = item_d['itemid']
            if lineType == 'MPLS' or lineType == 'Internet':
                i = i + 1
                itemid_to_vedge[itemid] = vedge_s # we need this b.c. Zabbix only returns itemid
                itemid_to_ltype[itemid] = lineType # This info is nice to see
                itemid_to_bw[itemid] = bandwidth # So we can get percentage used
                itemid_to_itemname[itemid] = itemname # So we can get percentage used
                item_l.append(itemid)
                if i > max_items:
                    print('item_l',item_l,flush=True)
                    params = {'itemids':item_l,'time_from':time_from,'time_till':time_till,'history':0,'limit':500000}
                    print('params',params)
                    res_d = zapi.do_request('history.get',params)
                    #print('res_d',res_d)
                    #exit()
                    print('After call to zapi.do_request')
                    result_l = res_d['result']
                    Pts = aux_modules.zabbix_to_pts(result_l,itemid_to_vedge,itemid_to_ltype,itemid_to_bw,itemid_to_itemname)
                    for Pt in Pts:
                        print('Pt',Pt,flush=True)
                        # DEBUGGING!!! Normally call to data_entry is outside of this loop!!
                        #influx_modules.data_entry([Pt])
                    influx_modules.data_entry(Pts,gmtOffset_d)
                    item_l = [] # empty out item list
                    i = 0
                    sleep(0.2)
else:
# we have to deal with leftovers which did not fill the max_items
    if i > 0:
                    print('Remainder section')
                    print('item_l',item_l,flush=True)
                    params = {'itemids':item_l,'time_from':time_from,'time_till':time_till,'history':0,'limit':500000}
                    res_d = zapi.do_request('history.get',params)
                    print('After call to zapi.do_request')
                    result_l = res_d['result']
                    Pts = aux_modules.zabbix_to_pts(result_l,itemid_to_vedge,itemid_to_ltype,itemid_to_bw,itemid_to_itemname)
                    for Pt in Pts:
                        # DEBUGGING!!! normally data_entry is called after this loop
                        print('Pt',Pt,flush=True)
                        #influx_modules.data_entry([Pt])
                    influx_modules.data_entry(Pts,gmtOffset_d)
print('All done feeding influxdb!')

I’m not saying it’s great code. I’m only saying that it gets the job done. I made it more generic in April 2023 so much fewer lines of code have hard-coded values, which even I recognized as ugly and limiting. I now feed the dict structure, which is pretty cool It relies on a couple auxiliary scripts. Here is aux_modules.py (it may include some packages I need later on).

import re
import time as tm
import numpy as np

def zabbix_to_pts(result_l,itemid_to_vedge,itemid_to_ltype,itemid_to_bw,itemid_to_itemname):

# turn Zabbix results into a list of points which can be fed into influxdb
# [{'itemid': '682837', 'clock': '1671036337', 'value': '8.298851463718859E+005', 'ns': '199631779'},

    Pts = []
    for datapt_d in result_l:
        itemid = datapt_d['itemid']
        time = datapt_d['clock']
        value_s = datapt_d['value']
        value = float(value_s) # we are getting a floating point represented as a string. Convert back to float
        hostname = itemid_to_vedge[itemid]
        ltype = itemid_to_ltype[itemid]
        itemname = itemid_to_itemname[itemid]
# item is a hybrid tag, like a primary tag key
        iface_dir = re.sub(r'(\S+) interface (\S+) .+',r'\1_\2',itemname)
        item = hostname + '_' + ltype + '_' + iface_dir
        if itemid in itemid_to_bw:
            bw_s = itemid_to_bw[itemid]
            bw = int(bw_s)
            if bw == 0:
                percent = 0
            else:
                percent = int(100*value/bw)
        else:
            percent = 0
        #tags = [{'tag':'hostname','value':hostname},{'tag':'itemid','value':itemid},{'tag':'ltype','value':ltype},{'tag':'item','value':item}]
        tags = {'hostname':hostname,'itemid':itemid,'ltype':ltype,'item':item}
        fields = {'value':value,'percent':percent}
        Pt = {'measurement':'vedge','tags':tags,'fields':fields,'time':time}
        Pts.append(Pt)
    return Pts
def itembasedd(json_data,Region):
# json_data is the master json file the vedge inventory
    itemBasedD = {}

    offsetdflt = {'AP':8,'NA':-5,'EU':1,'SA':-3}

    for SSID_k in json_data:
        SSID_d = json_data[SSID_k]
        print('SSID_k',SSID_k)
        region = SSID_d['region']
        if not region == Region: continue # just look at region of interest
        siteCategory = SSID_d['siteCategory']
        if 'gmtOffset' in SSID_d:
            time_off = SSID_d['gmtOffset']
        else:
            time_off = offsetdflt[region]
        for vedge_k in SSID_d['hostname']:
            vedge_l = SSID_d['hostname'][vedge_k]
            #print('vedge_d type',vedge_d.__class__)
            #print('vedge_d',vedge_d)
            for this_item_d in vedge_l:
                    print('this_item_d',this_item_d)
                    if not 'lineType' in this_item_d: continue
                    ltype = this_item_d['lineType']
                    if not (ltype == 'MPLS' or ltype == 'Internet'): continue
                    itemname = this_item_d['itemname']
                    if not re.search('gress ',itemname): continue
                    itemid =  this_item_d['itemid']
                    if not 'zabbixSpeed' in this_item_d: continue # some dicts may be historic
                    zabbixSpeed = int(this_item_d['zabbixSpeed']) # zabbixSpeed is stoed as a string
                    iface = re.sub(r' interface .+','',itemname)
                    direction = re.sub(r'.+ interface (\S+) traffic',r'\1',itemname)
                    item = vedge_k + '_' + ltype + '_' + iface + '_' + direction
# we may need additional things in this dict
                    itemBasedD[itemid] = {"item":item, "Time_Offset":time_off,"region":region,"speed":zabbixSpeed,'category':siteCategory}
                    print('itemid,itemBasedD',itemid,itemBasedD[itemid])
# let's have a look
#for itemid,items in itemBasedD.items():
#for itemid,items in itemBasedD.items():
#    print("item, dict",itemid,items)

    return itemBasedD

def getitemlist(region,itemBasedD,max_items):
# return list of itemids we will need for this region
    iteml1,iteml2 = [],[]
    for itemid,items in itemBasedD.items():
        if itemid == '0000': continue
        iregion = items['region']
        if iregion == region:
            if len(iteml1) == max_items:
                iteml2.append(itemid)
            else:
                iteml1.append(itemid)

    return iteml1,iteml2

def get_range_data(alldata,itemD):
    data_range = []
#
    for datal in alldata:
        #print("datal",datal)
# check all these keys...
        itemid = datal["itemid"]
        timei = datal["clock"]
        timei = int(timei)
# timei is CET. Subtract 3600 s to arrive at time in UTC.
        timei = timei - 3600
# hour of day, UTC TZ
        H = int(tm.strftime("%H",tm.gmtime(timei)))
# trasform H based on gmt offset of this vedge
        H = H + itemD[itemid]["Time_Offset"]
        H = H % 24
# Now check if this hour is in range or 7 AM 7 PM local time
        #if H < 7 or H > 18:
        if H < 8 or H > 17: # change to 8 AM to 6 PM range 22/03/08
        #print("H out of range",H)
            continue
        data_range.append(datal)

    return data_range

def massage_data(alldata,item_based_d):
# itemvals - a dict indexed by itemid
    itemvals = {}
    #print("alldata type",alldata.__class__)
    for datal in alldata:
# datal is a dict
        #print("datal type",datal.__class__)
        #print("datal",datal)
        val = datal["value"]
        valf = float(val)
        itemid = datal["itemid"]
        if not itemid in itemvals:
            itemvals[itemid] = []
        itemvals[itemid].append(valf)

    return itemvals

def domath(itemvals,item_based_d):
    for itemid,valarray in itemvals.items():
        #print("itemid,valarray",itemid,valarray)
        avg = np.average(valarray)
        n95 = np.percentile(valarray,95)
        max = np.amax(valarray)
        speed = item_based_d[itemid]["speed"]
        if speed > 0:
            avg_percent = 100*avg/speed
            n95_percent = 100*n95/speed
            max_percent = 100*max/speed
        else:
            avg_percent = 0.0
            n95_percent = 0.0
            max_percent = 0.0

        avgm = round(avg/1000000.,1) # convert to megabits
        n95m = round(n95/1000000.,1)
        maxm = round(max/1000000.,1)
        item_based_d[itemid]["avg"] = avgm
        item_based_d[itemid]["n95"] = n95m
        item_based_d[itemid]["max"] = maxm
        item_based_d[itemid]["avg_percent"] = round(avg_percent,1)
        item_based_d[itemid]["n95_percent"] = round(n95_percent,1)
        item_based_d[itemid]["max_percent"] = round(max_percent,1)
        item_based_d[itemid]["speedm"] = round(speed/1000000.,1)

    #print("item_based_d",item_based_d)

def pri_results(item_based_d):
    print('item_based_d',item_based_d)

def stats_to_pts(item_based_d):

# turn item-based dict results into a list of points which can be fed into influxdb
#{'683415': {'item': 'NAUSNEWTO0057_vEdge1_MPLS_ge0/1.4000_ingress', 'region': 'NA', 'category': 'Hybrid Silver+', 'avg': 4.4, 'n95': 16.3, 'max': 19.5, 'avg_percent': 22.0, 'n95_percent': 81.6, 'max_percent': 97.3,

    Pts = []
    time = int(tm.time()) # kind of a fake time. I don't think it matters
    for itemid,itemid_d in item_based_d.items():
        category = itemid_d['category']
        item = itemid_d['item']
        region = itemid_d['region']
        t_off = itemid_d['Time_Offset']
        speed = float(itemid_d['speed']) # speed needs to be a float
        if 'avg' in itemid_d and 'n95' in itemid_d:
            avg = itemid_d['avg_percent']
            n95 = itemid_d['n95_percent']
            max = itemid_d['max_percent']
        else:
            avg,n95,max = (0.0,0.0,0.0)
        tags = {'item':item,'category':category,'region':region,'GMT_offset':t_off}
        fields = {'avg':avg,'n95':n95,'max':max,'speed':speed}
        Pt = {'measurement':'vedge_stat','tags':tags,'fields':fields,'time':time}
        Pts.append(Pt)
    return Pts
def gmtOffset_calc(SSID_d):
    offsetdflt = {'AP':8,'NA':-5,'EU':1,'SA':-3}
    region = SSID_d['region']
    if 'gmtOffset' in SSID_d and SSID_d['gmtOffset']:
        gmtOffset = SSID_d['gmtOffset']
    else:
        gmtOffset = offsetdflt[region]
    return gmtOffset

Next I’ll show influx_modules.py.

import influxdb_client, os, time
from urllib3 import Retry
from datetime import datetime, timezone
import pytz
from influxdb_client import InfluxDBClient, Point, WritePrecision
from influxdb_client.client.write_api import SYNCHRONOUS
import random,re

def data_entry(Pts,gmtOffset_d):
# Set up variables
    bucket = "poc_bucket2" # DrJ test bucket
    org = "poc_org"
    influxdb_cloud_token = os.environ['INFLUX_AUTH_TOKEN']
# PROD setup
    bucket_prod = "UC02" # we are use case 2
    #bucket_prod = "test" # we are use case 2
    org_prod = "DrJohns - Network Visibility"
    influxdb_cloud_token_prod = os.environ['INFLUX_AUTH_TOKEN_PROD']

# Store the URL of your InfluxDB instance
    url_local ="http://10.199.123.233:8086/"
    url_prod ="https://westeurope-1.azure.cloud2.influxdata.com/"
# we get occasional read timeouts. Let's see if this helps. -DrJ 2023/09/15 https://github.com/influxdata/influxdb-client-python#handling-errors
    retries = Retry(connect=10, read=10, redirect=5)
# Initialize client
    client = influxdb_client.InfluxDBClient(url=url_local,token=influxdb_cloud_token,org=org)
    client_prod = influxdb_client.InfluxDBClient(url=url_prod,token=influxdb_cloud_token_prod,org=org_prod,timeout=30000,retries=retries)

# Write data
    write_api = client.write_api(write_options=SYNCHRONOUS)
    write_api_prod = client_prod.write_api(write_options=SYNCHRONOUS)

    pts = []
    SSID_seen_flag = {}
    for Pt in Pts:
        item = Pt['tags']['item']
        time = int(Pt['time'])

# look up the gmtOffset. SSID is the key to the gmt dict
        SSID = re.sub(r'_.+','',item) # NAUSNEWTOO0001_vEdge1_MPLS_ge0/1.4084_ingres
        gmtOffset = gmtOffset_d[SSID] # units are hours, and can include fractions
        gmtOffset_s = int(3600 * gmtOffset)
        time_local = time + gmtOffset_s
# convert seconds since epoch into format required by influxdb. pt_time stays utc, not local!
        pt_time = datetime.fromtimestamp(time, timezone.utc).isoformat('T', 'milliseconds')
# pull out the UTC hour
        ts = datetime.fromtimestamp(time_local).astimezone(pytz.UTC)
        Hlocal = ts.strftime('%H')
        if len(Hlocal) == 1: Hlocal = '0' + Hlocal # pad single digits with a leading 0 so sort behaves as expected
# extend dict with tag for UTChour
        Pt['tags']['UTChour'] = Hlocal
# overwrite time here
        Pt['time'] = pt_time
        if not SSID in SSID_seen_flag:
            #print('item,Hlocal,gmtOffset,gmtOffset_s,time,time_local',item,Hlocal,gmtOffset,gmtOffset_s,time,time_local) # first iteration print
            print('item,Pt',item,Pt)
            SSID_seen_flag[SSID] = True
        ##point = Point(measurement).tag("hostname",hostname).tag("itemid",itemid).tag("ltype",ltype).tag("item",item).tag("UTChour",Hlocal).field('value',value).field('percent',percent).time(pt_time)
# based on https://github.com/influxdata/influxdb-client-python/blob/master/influxdb_client/client/write/point.py
        point = Point.from_dict(Pt)

        pts.append(point)
# write to POC and PROD buckets for now
    print('Writing pts to old and new Influx locations')
    write_api.write(bucket=bucket, org="poc_org", record=pts, write_precision=WritePrecision.S)
    write_api_prod.write(bucket=bucket_prod, org=org_prod, record=pts, write_precision=WritePrecision.S)

def data_entry_stats(Pts):
# Set up variables
    bucket = "poc_bucket2" # DrJ test bucket
    org = "poc_org"
    influxdb_cloud_token = os.environ['INFLUX_AUTH_TOKEN']

# Store the URL of your InfluxDB instance
    url_local ="http://10.199.123.233:8086/"
    url_prod ="https://westeurope-1.azure.cloud2.influxdata.com/"

# PROD setup
    bucket_prod = "UC02" # we are use case 2
    org_prod = "DrJohns - Network Visibility"
    influxdb_cloud_token_prod = os.environ['INFLUX_AUTH_TOKEN_PROD']

# Initialize client
    client = influxdb_client.InfluxDBClient(url=url_local,token=influxdb_cloud_token,org=org)
    client_prod = influxdb_client.InfluxDBClient(url=url_prod,token=influxdb_cloud_token_prod,org=org_prod)

# Write data
    write_api = client.write_api(write_options=SYNCHRONOUS)
    write_api_prod = client_prod.write_api(write_options=SYNCHRONOUS)

    pts = []
    for Pt in Pts:
# debug
#        print('avg type',avg.__class__,'item',item,flush=True)

        time = Pt['time']
# convert seconds since epoch into format required by influxdb
        pt_time = datetime.fromtimestamp(int(time), timezone.utc).isoformat('T', 'milliseconds')
# overwrite time here
        Pt['time'] = pt_time
        ##point = Point(measurement).tag("item",item).tag("category",category).tag("region",region).tag("GMT_offset",t_off).field('n95',n95).field('avg',avg).field('max',max).field('speed',speed).time(pt_time)
# see aux_modules stats_to_Pts for our dictionary structure for Pt
        point = Point.from_dict(Pt)
        pts.append(point)
    print('Write to old and new influxdb instances')
    write_api.write(bucket=bucket, org="poc_org", record=pts, write_precision=WritePrecision.S)
    write_api_prod.write(bucket=bucket_prod, org=org_prod, record=pts, write_precision=WritePrecision.S)

These scripts show how I accumulate a bunch of points and make an entry in influxdb once I have a bunch of them to make things go faster. These days I am updating two influxdb instances: a production one that actually uses InfluxDB Cloud (hence the URL is a generic endpoint which may actually work for you), and a POC one which I run on my private network.

What it looks like

This is the view of multiple vedges which match the selection criteria of high bandwidth usage in region Europe:

Then I figured out how to provide a link to a detailed traffic graph for this selection criteria. Obviously, that mostly involved switching the visualization to Time Series. But I wanted as well to provide the interface bandwidth on the same graph. That was tricky and involved creating a transform that is a config query which takes speed from the table and turns it into Threshold1, which I draw as a red dashed line. It’s sort of too much detail to go into it further in this article. I wanted to make a second config query but it turns out this is not supported – still.

As for the link, I have a text panel where I use raw html. My HTML, which creates the active link you see displayed is:

<br>
<H2>
<a target="details-multiple-ifaces" href=
"/d/8aXikCa4k/multiple-vedges-graph?orgId=1&${Region:queryparam}&${threshold:queryparam}&${math:queryparam}&${math_days:queryparam}">
Detailed traffic graph for matching interfaces</a>
</H2>

So here is what the detailed traffic graph looks like:

I love that red dashed line showing the interface bandwidth capacity!

I almost forgot to mention it, there is a second query, B, which I use as a basis for the dynamic threshold to pick up the “speed” of the interface. Here it is:

data = from(bucket: "UC02")
  |> range(start: -1d, stop: now())
  |> filter(fn: (r) =>
    r._measurement == "vedge_stat" and
    r._field == "speed" and r.item == "${item}"
  )
  |> drop(columns: ["item","category","region","GMT_offset"])
data
Back to single vedge

At the top of this post I showed the heat map for a single vedge. It includes an active link which leads to a detailed traffic graph. That link in turn is in a Text Panel with HTML text. This is the HTML.

<br>
<H2>
<a target="details-single-iface" href=
"/d/ozKDXiB4k/single-vedge-graph?orgId=1&${Region:queryparam}&${threshold:queryparam}&${math:queryparam}&${math_days:queryparam}&${item:queryparam}">
Detailed traffic graph for ${item}</a>
</H2>

The single vedge detailed graph is a little different from the multiple vedge detailed graph – but not by much. I am getting long-winded so I will omit all the details. Mainly I’ve just blown up the vertical scale and omit panel iteration. So here is what you get:

In full disclosure

In all honesty I added another field called speed to the vedge_stats InfluxDB measurement. It’s kind of redundant, but it made things a lot simpler for me. It is that field I use in the config query to set the threshold which I draw with a red dashed line.

Not sure I mentioned it, but at some piont I re-interpreted the meaning of UTChour to be local time zone hour! This also was a convenience for me since there was a desire to display the heat maps in the local timezone. Instead of messing around with shifting hours in flux query language – which would have taken me days or weeks to figure out, I just did it in my python code I (think) I shared above. So much easier…

Now for the harder stuff, v2.1

Loop over a date range

I really, really wanted my panels to repeat, one panel per day, based on a template variable which represented the days between the two points on the time picker. I.e., i now want my dashboards to incorporate the time picker. Grafana, even v 10, makes this extremely difficult to do. But I did succeed in the end. Painfully. So how did I do it?

I defined a hidden variable I call days3. days1 and 2 were failed attempts! Here is days3, which is a query variable type:

import "regexp"
import "date"
//item = "NAUSDRJOHN0329_vEdge1_MPLS_ge0/1.40_egress" // for testing
startTrunc = date.truncate(t: v.timeRangeStart, unit: 1d)
stopTruncTmp = date.truncate(t: v.timeRangeStop, unit: 1d)
stopTrunc = date.add(d: 1d, to: stopTruncTmp)
from(bucket: "DrJ02")
|> range (start: startTrunc, stop:stopTrunc)
|> filter(fn:(r) => r._measurement == "vedge_stat" and r.item == "${item}")
|> keep(columns:["_time","_value"])
|> aggregateWindow(every: 1d, timeSrc: "_start", fn: last)
|> map(fn: (r) => ({r with timeString: string(v: r._time)}))
|> keep(columns:["timeString"])
|> map(fn: (r) => ({r with timeString: regexp.replaceAllString(r: /T00.*/, v: r.timeString, t:"")}))

I guess it’d take forever to explain. It shows why I call it v 2.1: I learned more stuff since I started out. This produces days3 as a string. Its values are 2013-06-20 2013-06-19, etc. I found I had to choose multi-value and permit All values for it to work with my panel. Not sure why… Then my panel iterates against this variable. Actually not one but two side-by-side panels! The left side produces the day and date, and each of its rows sits alongside the right panel which contains the data for that day. Here is the left hand stat visualization Flux query:

import "array"
import "date"
import "dict"
days3Time = time(v: "${days3}")
month = string(v: date.month(t: days3Time))
day = string(v: date.monthDay(t: days3Time))
dayWint = date.weekDay(t: days3Time)
weekdayOnlyDict = [0:"   ", 1:"Mon", 2:"Tue", 3:"Wed", 4:"Thu", 5:"Fri", 6:"   "]
DAY = dict.get(dict:weekdayOnlyDict, key:dayWint, default:"")
niceDate = DAY + " " + day + "." + month
arr = [{valueString: niceDate}]
array.from(rows: arr)

I don’t use comments because it’s a tiny window you enter code into. So I had to convert days, which was a string, into a time with the time() function. array.from is your friend and can turn a simple array with one element into a table which will be accepted by stat.

Why stat and why not a Text panel?

The fonts deployed in stat are much better than what you can probably whip up yourself in a Text panel. So no sense wasting time figuring out CSS and all. Just give in a use Stat.

So that left hand side produces a row with Tue 20.6 for today. It has a vyer narrow width and it repeats vertically.

Query for template vs query for the panel

They’re kind of sort of both using Flux. But probably the panel needs to incorporate more logic.

Invalid character ‘\x1f’ looking for beginning of value

You know that days3 variable I mentioned above? I swear this is true. I referred to it in my first panel query using naked reference ${days3} and all was good as bonus it was even interpreted as a time variable. Then for my second panel I literally copied the query from the first one, and yet I got this error: Invalid character ‘\x1f’ looking for beginning of value. As all my eperiments cost me precious time, I don’t have the exact reason for this error. It can be one of these things:

  • reference to variable where All and multi-value was selected
  • naked reference to template varibale: ${template_variable} instead of “${template_variable}”

Most likely it’s related to the second bullet item. So in my second panel I had to introduce the double-quotes, which had the side-effect of I guess forcing its interpretation as a string, not a time, which required me to convert it. And then – I swear this is the case – after fixing my second panel up, my first panel – which mind you had been fine all this time – began to complain as well! So then I had to go into my previously working first panel query and put double-quotes around its reference to days3 and convert it to a time. I became scared about variable scope and assigned days3 to a different variable to be safe. Again, no time for proper experimentation.

That’s either a bug or a screwy fact about variable scope.

compilation failed. error Unexpected token for property key DOT (.)

This error usually occurs when you reference a non-existent column name in a map function, perhaps due to a typo in the column name.

Template Variable reference (interpolation)

Sometimes a naked reference to a template variable such as ${template_variable} suffices, and sometimes you gotta have double-quotes around that reference. And you may even need the double quotes in a chained variable whereas your panel query expression is fine without them. I have to read up more on what’s going on there to come up with a consistent recommendation. For now it’s trial and error.

The easiest way to use stat visualization for a text header
import "array"
arr = [{valueString: "SSID vedge ltype iface direction"}]
array.from(rows: arr)

Yup. That works. And you get a very nice stat box. Just choose showing Fields: valueString, Text mode: value, Text size (value): 14, Color mode: background solid, and maybe a nice base color like kind of dark blue. This is good for a header column which is a panel which is one unit high.

array.from is your friend

And it magically makes tables out of any kind of junk as in the example directly above, by junk I mean when you don’t care about the time.

Tip: Grafana’s error recognition is good

It tries to tell you exactly which column within which line of your query it doesn’t like.

But, also, frequently switch between the stat view and Table to really see what’s going on. Because you only get a small wnidow real estate, use your mouse scroll button after appropriately positioning your mouse to view all values.

Tip: frequently shift between regular view and stat view

You are dead without that table view option as you develop your queries. And use a large monitor! And learn to minimize the Stat options by hitting the rightward point arrow. You will make lots of mistakes and the Table view is the only way to see what’s going on.

Avoid Row of panels

I was initially using row of panels. It wastes lots of precious screen real estate. A minimum of 3 vertical units are required, I believe. And it was slow. Instead, I found I can use side-by-side vertically repeating panels each one unit high.

Panel geometry

I keep referring to units. It’s a term I made up. But panels seem to observe snap geometry and can only come in certain units of sizes. The smallest unit is reasonable, I suppose. Working with small panels can be a bear however – tring to find where to position the mouse to get to the Edit menu can be challenging.

WAN Report example, July 2023

Meanwhile I’ve learned some more tricks and I will share everything… Some thing are obvious, some not so much. The end result is 60 awesome lines of Flux language coding!

Setting the scene

I produce a WAN report for the network group to peruse. Each row is an interesting interface on a vedge, either the Inernet or MPLS interface. The combo site ID_vedgeNumber_iface-type_iface-name_direction is the left column. On the right are columns which provide the average, n95 and max values of that interface for, by default, the last 24 hours, but only during business hours.

It looks like this:

WAN Capacity Report, blurred

The template variables at the top are:

  • Region
  • threshold
  • math
  • math_days
  • GMT_offset
  • country
  • SSID
  • item
A word about the panel layout

I realized you don’t need and should not use repeating rows of panels. Just iterate over something. I iterate over item which I think of as a dependent variable. Everything before item is a filter criteria which determines the included items. Then I simply size a panel on the left to be one unit high and just long enogh to accommodate my longest item name. To the right of that is a second panel sized also to be one unit high and several units long. Both are set up with the stat visualization and repeat for template variable item. I don’t know why I didn’t think of this approach earlier!

Start simple: the left column containing the item names

Easy peasey:

import "array"
arr = [{valueString: "${item}"}]
array.from(rows: arr)

Note I am using the array.from trick to create a table. I use the stat visualization even though Text is the more obvious choice because stat has excellent fonts!

In stat I repeat by variable item as already mentioned. The field valueString as the one to be included in the panel. Color mode is background solid, Text mode is Value. I have a Data link to my single vedge heatmap. This link is: /d/fdasd-1a18-f9037d3/single-vedge-heat-map-v2-1?orgId=1&&var-days3=All&${Region:queryparam}&${threshold:queryparam}&${math:queryparam}&${math_days:queryparam}&${item:queryparam}

Panel with avg/n95/max

By contrast, this panel forced me to learn new techniques and is somewhat complex.

import "math" // v 2.13 -DrJ 2023.07.05
import "array"
import "regexp"
import "strings"
import "date"
import "dict"
import "join"
CC = strings.substring(v: "${item}", start:2, end:4) // returns, e.g., US
regionOffsetDict = ["AP":8h,"NA":-4h,"EU":0h,"SA":-3h]
offset_dur = dict.get(dict:regionOffsetDict, key:"${Region}", default:0h)
startRegion = date.add(d: offset_dur, to: v.timeRangeStart)
startTrunc = date.truncate(t: startRegion, unit: 1d)
stopRegion = date.add(d: offset_dur, to: v.timeRangeStop)
stopTrunc = date.truncate(t: stopRegion, unit: 1d)
startDayInt = date.yearDay(t: startTrunc)
stopDayInt = date.yearDay(t: stopTrunc)
timePickerDays = stopDayInt - startDayInt
data = from(bucket: "${bucket}")
  |> range(start:startTrunc, stop: stopTrunc)
  |> filter(fn: (r) =>
    r._measurement == "vedge" and
    r._field == "percent" and r.hostname =~ /^${Region}/ and r.item == "${item}" and 
    (r.UTChour == "08" or r.UTChour == "09" or r.UTChour == "10" or r.UTChour == "11" 
    or r.UTChour == "12" or r.UTChour == "13" or r.UTChour == "14" or r.UTChour == "15"
    or r.UTChour == "16" or r.UTChour == "17" or r.UTChour == "18")
  )
  |> map(fn: (r) => ({r with dayNumber: date.weekDay(t: r._time) })) // get day of week
  |> map(fn: (r) => ({r with day: date.yearDay(t: r._time) })) // get day of year
  |> map(fn: (r) => ({r with day: string(v: r.day)})) // convert day to string cf. day tag in  holidays measurement
  |> map(fn: (r) => ({r with workDay: if r.dayNumber == 0 then false else if r.dayNumber == 6 then false else true  }))
  |> filter(fn: (r) => r.workDay == true or timePickerDays == 1) // just consider work days, i.e., Mon - Fri unless today is Monday
 // |> map(fn: (r) => ({r with day: "185"})) //JH TEMP for debugging the join
  |> keep(columns:["day","_value"])
holidays = from(bucket: "${bucket}") // extract all the holidays for this country
  |> range(start:-58d)
  |> filter(fn: (r) =>
    r._measurement == "holidays" and r.CC == CC
    )
  |> last() // to only spit out the most recent run
  |> group(columns:["year","CC"])
  |> keep(columns:["day","_value"])
myjoin = join.left(   // join iface data with holiday data
    left: data,
    right: holidays,
    on: (l, r) => l.day == r.day,
    as: (l, r) => ({_value: l._value, holiday_flag:r._value})
  )
dataNoHolidays = myjoin // only take data where there was no holiday OR time period == 1 day
 |> filter(fn: (r) => not exists r.holiday_flag or timePickerDays == 1)
 |> keep(columns:["_value"])
meanTbl = dataNoHolidays |> mean()
maxTbl = dataNoHolidays
  |> max()
  |> toFloat()
n95Tbl = dataNoHolidays |> quantile(q: 0.95)
3values = union(tables: [meanTbl,n95Tbl,maxTbl])
 |> map(fn: (r) => ({r with _value: math.trunc(x: r._value)}))
 |> map(fn: (r) => ({r with valueString: string(v: r._value)+"%"}))
 |> keep(columns:["_value","valueString"])
3values

There is a lot going on here but when you break it down it’s not so bad. Let me highlight the last peice I struggled with last week.

To begin with I wanted to know if we could exclude weekends in the calculation. That is possible. The idea is that we create a columns dayNumber and workDay in that order. dayNumber is the day of the week and workDay is a boolean – either true or false. I found you could do this in incremental steps.

map(fn: (r) => ({r with dayNumber: date.weekDay(t: r._time) })) // get day of week

sets up the dayNumber column and then

map(fn: (r) => ({r with workDay: if r.dayNumber == 0 then false else if r.dayNumber == 6 then false else true  }))

creates the boolean workDay. Then you run the filter:

filter(fn: (r) => r.workDay == true or timePickerDays == 1) // just consider work days, i.e., Mon - Fri unless today is Monday

So now you have tossed out weekend days. Cool.

Extend this to exclude national holidays

Extending this to also exclude national holidays is much harder however. At least I did not find a nice way to do it, so I did it the hard way. Of course first I had to etermine the national holidays. I didn’t want to use an api because I feel that would slow things down and speed is of essence.

The main idea is to use the python holidays package. I suppose I could have created a CSV file with the results but I didn’t. I stuffed the results into another measurement which I define as:

The holidays measurement

The main python code which fills this measurement is:

from datetime import date
from datetime import datetime
import holidays
from modules import aux_modules, influx_modules

year = datetime.now().year

# CCs was scraped from the country variable in Grafana from WAN Report
file = 'CCs'
with open(file) as f:
    CCs = f.read().splitlines()
print('CCs',CCs)

my_dict = {}

for CC in CCs:
    print(CC)
    CC_missing = False
    try:
        CC_holidays = holidays.country_holidays(CC,years=[year])
    except:
        CC_missing = True

    for mydate,name in sorted(CC_holidays.items()):
        day_of_year = mydate.timetuple().tm_yday
        print(CC,mydate,day_of_year)
        if not CC in my_dict: my_dict[CC] = []
        my_dict[CC].append(day_of_year)
# prepare our Pts data structure
Pts = aux_modules.holidays_to_pts(year,my_dict)
for Pt in Pts:
    print('Pt',Pt,flush=True)
influx_modules.data_entry_stats(Pts)
print('All done!')

The file CCs just has countries of interest, one per line:

US

CA

MX

ES

etc

That holidays_to_pts function is very basic:

def holidays_to_pts(year,my_dict):
# turn dict into a list of points which can be fed into influxdb
# {'US':[1,67,357]}

    Pts = []
    time = int(tm.time()) # kind of a fake time. I don't think it matters
    for CC,holiday_list in my_dict.items():
        for day in holiday_list:
            tags = {'CC':CC,'year':year,'day':day}
            fields = {'holiday_flag':True}
            Pt = {'measurement':'holidays','tags':tags,'fields':fields,'time':time}
            Pts.append(Pt)
    return Pts

So we’ve got our holidays measurement. I suppose it could have been written to CSV file but I got used to working with time series so I stuffed the holiday information into InfluxDB as though it were another time series.

I do not see an easy way to refer to a different measurement without introducing a new expression and then joining the data all together with a left outer join. Ideally I would have stuffed the holiday information into a dictionary and then just done a lookup against the dictionary. And I eventually thought of a way to fill a dictionary: use out-of-band code generation! But that would be ugly so i veto that idea. Back to the left outer join approach: I eventually got this to work after much trial and error.

Thus I build up the data table with all my filters and such and reduce it at the end to the bare necessities for the join: keep only the columns day and _value. And I create a holidays table which at the end also contins only those two columns. So now I have a common group key and can do the join based on the identical day values. As my holidays is compact and only contains entries for the actual holiday days, the join produces a null value for a day in which there is no match. So the left outer join produces a table, myjoin, with the columns from the data table, plus a column holiday_flag from holidays with the value either true or null.

Left outer join

To get it right definitely required working up from simpler examples using Explorer.

import "array"
import "join"

left =
    array.from(
        rows: [
            {_time: 2022-01-01T00:00:00Z, _value: 1, day: 1, label: "a", percent: 1},
            {_time: 2022-01-01T00:00:00Z, _value: 2, day: 1, label: "b", percent: 2},
            {_time: 2022-02-01T00:00:00Z, _value: 22, day: 1, label: "b", percent: 3},
            {_time: 2022-01-01T00:00:00Z, _value: 3, day: 1, label: "d", percent: 4},
            {_time: 2023-01-01T00:00:00Z, _value: 4, day: 1, label: "a", percent: 5},
			{_time: 2023-01-01T00:00:00Z, _value: 4, day: 11, label: "a", percent: 11},
        ],
    )
right =
    array.from(
        rows: [
            {_time: 2022-01-01T00:00:00Z, _value: true, day: 11, label: "a"},
            {_time: 2022-01-01T00:00:01Z, _value: true, day: 11, label: "a"},
            {_time: 2022-01-01T00:00:00Z, _value: false, day: 11, label: "b"},
            {_time: 2022-01-01T00:00:00Z, _value: true, day: 11, label: "d"},
        ],
    )
join.tables(
    method: "left",
    left: left,
    right: right,
    on: (l, r) => l.label == r.label and l.day == r.day,
    as: (l, r) => ({_time: l._time, label: l.label, v_left: l._value, v_right: r._value, percent: l.percent}),
)

So this helps to get practice: getting the group key to be identical in the two tables, seeing what happens when the left table has entries unmatched in the right table, the right table has duplicated keys, etc.

If I ever figure out a way to avoid the left outer join – and I suspect there is a way – I will use it and update my code. part of my problem is that Flux sucks as far as iterations go. The time windowa may span more than one day. Ideally I would loop over days (then holidays could be stuffed into a dictionary I suppose), but bear in mind I’m already repeating by item. Loop programming is not supported in Flux, which sucks.

So, anyway, myjoin can now be filtered to only accept non-holiday days:

filter(fn: (r) => not exists r.holiday_flag or timePickerDays == 1)

That produces the dataNoHolidays table which is then used for aggregate functions such as mean() , max() and quantile(). The mean, n95 and max are tables because basically everything is tables and you can’t avoid it. So then a relatively simple union of tables is used to join these three tables to get our three numbers output.

Monday effect

I also wanted to make the time picker active since that is the “native” way to select time periods. I made the default time from now – 24 hours to now. But I then truncate on day boundaries to keep results as naively expected. But how to avoid the Monday effect? You’ve exlcuded weekends so now you have no data if it is a Monday!

Hence I also calculate the number of days in the time picker range (timePickerDays). If it’s only one day (the default), then do include weekends and holidays. If it’s longer than one day then someone has interacted with the time picker and has intention to get something of meaning back out such as the n95 value when considering just workdays (no weekends or holidays) from, say, the past week.

Text panel

There is a text panel which explains things. I may as well include all that verbatim.

#### WAN Capacity Report
Reporting on interfaces of vedges which match the selection criteria.  
##### v 2.13 Release Notes
Weekends and national holidays are now excluded in the reported n95/avg/max values. *  
Thus you can make the time period the last seven days and get results
which only considers business hours Mon - Fri and excludes national Holidays.
The default view now shows the previous day.  
Time picker is available, but note that date truncation on day boundaries is in effect.  

Link to graph is region-aware and proposes region-appropriate last data.  
The link to the single vedge heatmap is back!
Click on the name of the vedge to get the historical heatmap for this interface.
Improved formatting.  

*) This behavior is overwritten when the default time picker of last 24 hours is in effect
to avoid the situation where no
data will be shown such as on a Monday or the day after a holiday. But if you choose a custom timespan longer than one day,
weekends and national holidays specific to the country of that SSID
will be excluded from the mean/n95/max calculations.

To find highly utilized interfaces, narrow the search by choosing a region, a threshold and optionally a site category or even a GMT offset. All the vedge interfaces which match the criteria will be displayed in the drop down list.  
The statistics are gathered from the previous day (if math_days is set to 1). Three different algorithms are calculated.
n95 is the 95% percentile usage point (over business hours), avg is the average (over business hours) and max is the peak usage point.  
The GMT_offset is a semi-dependent variable: its range of values is restricted by the region chosen.
It is possible to produce 0 matching results by over constraint.  
To test matching on the other variables, lower the threshold to 0.  
All number all percentages.
The average (avg), maximum (max) and 95% are calculated for just the business hours of that interface, i.e., from 8 AM to 6 PM.
Click on a displayed item to bring up the historic heat map data from the last few days for it.  
Dark red - 75% and higher  
Light red - 70 - 74 %  
Orange - 60 - 69 %  
Green - 40 - 59 %  
Blue - 0 - 39 %  
Template variables

There is a hidden variable called bucket so I can easily move between DEV and PROD environments. item is the most dependent variable. Here it is:

from(bucket: "poc_bucket2")
|> range (start: -${math_days}d)
|> filter(fn:(r) => r._measurement == "vedge_stat")
|> filter(fn:(r) => r.region == "${Region}")
|> filter(fn:(r) => contains(value: r.country, set: ${country:json}))
|> filter(fn:(r) => contains(value: r.SSID, set: ${SSID:json}))
|> filter(fn:(r) => contains(value: r.GMT_offset, set: ${GMT_offset:json}))
|> filter(fn:(r) => r._field == "${math}" and r._value >= ${threshold})
|> group()
|> distinct(column: "item")

The stat panel

The stat panel for the three values shows All Values, has a maximum number of rows to display of 3, repeat direction is Vertical, shows the field _value, has layout orientation Vertical, Text mode Name, Color mode background gradient and Text Size 14. Color scheme is from Thresholds by value.

What’s going on with the times?

Just to mention it, the interfaces all have their data recorded in their local time! In fact we pretned to InfluxDB that that time is UTC time, though it isn’t. This permitted us to do many cool things, not so much for this WAN report as for the multi vedge heatmap dashboard. We always display the heatmap in the local time of the vedge. But then some accomodation was needed to handle when you’re looknig at Asia and sitting in North America – they could even be onto the next day! Whatever day it is, they are way ahead of you in time, so you have to exceed now() to show their latest data – that sort of thing. But the accommodations were worth it. We just ask that users of the dashboards also set their time to UTC or else nothing makes sense.

WAN Report v 2.15, July 2023

I received suggestions on improving the WAN report. After those refinements it looks like this:

I’ve added the Allocated bandwidth column

So the additional column Allocated bandwidth was requested. That was pretty straightforward. In the detailed graph there was the request to combine ingress and egress interfaces. And always show the allocated bandwidth (it was only occasionally showing). Another request was to permit more than one Region to be selected. This sounds kind of easy, but it required some pretty significant re-work. Fortunately I have enough experience now that it was just adding on a good existing foundation.

And the result is just very visually appealing, I must say! Here is the graph, suitably blured:

This is what you get when you click on the Detailed traffic graph from the previous dashboard

Awesome, right!? Now how did I work the magic? Buckle your seats, it’s going to get bumpy…

Variables

Here are all my variables for the WAN Report, bunched together for compactness. Comment lines at the beginning provide variable name and a hint about the type.

# Region
# type Custom, multi-value
AP,EU,NA,SA

# RegionalNowOffset
# type: hidden
import "array"
import "dict"
regionOffsetDict = ["AP":11,"NA":-4,"EU":2,"SA":-3]
offset_dur = dict.get(dict:regionOffsetDict, key:"${Region}", default:11) // if multi-value, assume worst case
prefix = if offset_dur > 0 then "%2B" else ""
offsetString = prefix + string(v: offset_dur) + "h"
arr = [{valueString: offsetString}]
array.from(rows: arr)

# bucket
# type: custom, hidden

# threshold
# type: custom
0,20,40,60,80,90,95

# math
# type: custom
n95,avg,max

# math_days
# type: custom
1,2,7

# GMT_offset
# type: query, multi-value
from(bucket: "${bucket}")
|> range (start: -1d)
|> filter(fn:(r) => r._measurement == "vedge_stat")
//|> filter(fn:(r) => r.region == "${Region}")
|> filter(fn:(r) => contains(value: r.region, set: ${Region:json}))
|> group()
|> distinct(column: "GMT_offset")

# country
# type: query, multi-value
from(bucket: "${bucket}")
|> range (start: -1d)
|> filter(fn:(r) => r._measurement == "vedge_stat")
//|> filter(fn:(r) => r.region == "${Region}")
|> filter(fn:(r) => contains(value: r.region, set: ${Region:json}))
|> group()
|> distinct(column: "country")

# SSID
# type: query, multi-value
from(bucket: "${bucket}")
|> range (start: -1d)
|> filter(fn:(r) => r._measurement == "vedge_stat")
//|> filter(fn:(r) => r.region == "${Region}")
|> filter(fn:(r) => contains(value: r.region, set: ${Region:json}))
|> filter(fn:(r) => contains(value: r.country, set: ${country:json}))
|> group()
|> distinct(column: "SSID")

# item
# type: custom, multi-value
from(bucket: "${bucket}")
|> range (start: -${math_days}d)
|> filter(fn:(r) => r._measurement == "vedge_stat")
//|> filter(fn:(r) => r.region == "${Region}")
|> filter(fn:(r) => contains(value: r.region, set: ${Region:json}))
|> filter(fn:(r) => contains(value: r.country, set: ${country:json}))
|> filter(fn:(r) => contains(value: r.SSID, set: ${SSID:json}))
|> filter(fn:(r) => contains(value: r.GMT_offset, set: ${GMT_offset:json}))
|> filter(fn:(r) => r._field == "${math}" and r._value >= ${threshold})
|> group()
|> distinct(column: "item")

And that’s the easy part!

Link to detailed traffic graph

This is a text panel, actually. Here’s the text:

<br>
<h2>
<a target="details-multiple-ifaces" href="/d/8V9ka4k/multiple-vedges-graph?orgId=1&${Region:queryparam}&${threshold:queryparam}&${math:queryparam}&${math_days:queryparam}&${country:queryparam}&${SSID:queryparam}&from=-5d&to=now${RegionalNowOffset}">
Detailed traffic graph for matching interfaces>/a>
</h2>

The vedge_…iface_direction column query

I’ve previously discussed how I did the headers, so let’s talk about the queries which make up the columns. The leftmost column names the interface + direction. It uses a stat panel – remember I like the text output of stat!

import "array" //we rely on the template variable item to populate our string
import "strings"
region = strings.substring(v: "${item}", start:0, end:2) // returns, e.g., NA
arr = [{valueString: "${item}", region:region}]
array.from(rows: arr)

No big deal, right? Maybe the hardest thing was getting the link to work. The link takes you to the heatmap dashboard for that specific interface.

But just to mention it, the stat features: repeat by the variable item, repeat direction vertical, max no. of rows to display is 1, All Values, Fields valueString, Text mode Value, color mode Background solid, text size 14.

And here is the data link, suitably obfuscated:

/d/fads-f2c-418203d3/single-vedge-heat-map-v2-1?orgId=1&&var-days3=All&var-Region=${__data.fields.region}&${threshold:queryparam}&${math:queryparam}&${math_days:queryparam}&${item:queryparam}

So see what I did there? Compared to the earlier version I added the region to the time series so that as a variable it would be available to me to refer to in the data link. That was the only way I saw to do that.

Finding possible variables

I’ve mentioned this elsewhere, but to repeat since it’s a very helpful tip, just type $ and usually it will show you all the possible variable completions. Pick the one you want.

The mean, n95, max column

This is also a one unit height stat, of course, taking three values.

import "math" // v 2.15 -DrJ 2023.07.21
import "array"
import "regexp"
import "strings"
import "date"
import "dict"
import "join"
CC = strings.substring(v: "${item}", start:2, end:4) // returns, e.g., US
region = strings.substring(v: "${item}", start:0, end:2) // returns, e.g., NA
regionOffsetDict = ["AP":8h,"NA":-4h,"EU":0h,"SA":-3h]
offset_dur = dict.get(dict:regionOffsetDict, key:region, default:0h)
startRegion = date.add(d: offset_dur, to: v.timeRangeStart)
startTrunc = date.truncate(t: startRegion, unit: 1d)
stopRegion = date.add(d: offset_dur, to: v.timeRangeStop)
stopTrunc = date.truncate(t: stopRegion, unit: 1d)
startDayInt = date.yearDay(t: startTrunc)
stopDayInt = date.yearDay(t: stopTrunc)
timePickerDays = stopDayInt - startDayInt
data = from(bucket: "${bucket}")
  |> range(start:startTrunc, stop: stopTrunc)
  |> filter(fn: (r) => r._measurement == "vedge")
  |> filter(fn: (r) => r.item == "${item}") // this automatically takes care of the region match
  |> filter(fn: (r) =>  r._field == "percent" and 
    (r.UTChour == "08" or r.UTChour == "09" or r.UTChour == "10" or r.UTChour == "11" 
    or r.UTChour == "12" or r.UTChour == "13" or r.UTChour == "14" or r.UTChour == "15"
    or r.UTChour == "16" or r.UTChour == "17" or r.UTChour == "18")
  )
  |> map(fn: (r) => ({r with dayNumber: date.weekDay(t: r._time) })) // get day of week
  |> map(fn: (r) => ({r with day: date.yearDay(t: r._time) })) // get day of year
  |> map(fn: (r) => ({r with day: string(v: r.day)})) // convert day to string cf. day tag in  holidays measurement
  |> map(fn: (r) => ({r with workDay: if r.dayNumber == 0 then false else if r.dayNumber == 6 then false else true  }))
  |> filter(fn: (r) => r.workDay == true or timePickerDays == 1) // just consider work days, i.e., Mon - Fri unless today is Monday
  |> keep(columns:["day","_value"])
holidays = from(bucket: "${bucket}") // extract all the holidays for this country
  |> range(start:-58d)
  |> filter(fn: (r) =>
    r._measurement == "holidays" and r.CC == CC
    )
  |> last() // to only spit out the most recent run
  |> group(columns:["year","CC"])
  |> keep(columns:["day","_value"])
myjoin = join.left(   // join iface data with holiday data
    left: data,
    right: holidays,
    on: (l, r) => l.day == r.day,
    as: (l, r) => ({_value: l._value, holiday_flag:r._value})
  )
dataNoHolidays = myjoin // only take data where there was no holiday OR time period == 1 day
 |> filter(fn: (r) => not exists r.holiday_flag or timePickerDays == 1)
 |> keep(columns:["_value"])
meanTbl = dataNoHolidays |> mean()
maxTbl = dataNoHolidays
  |> max()
  |> toFloat()
n95Tbl = dataNoHolidays |> quantile(q: 0.95)
3values = union(tables: [meanTbl,n95Tbl,maxTbl])
 |> map(fn: (r) => ({r with _value: math.trunc(x: r._value)}))
 |> map(fn: (r) => ({r with valueString: string(v: r._value)+"%"}))
 |> map(fn: (r) => ({r with region: region})) // just needed for the link
3values

So it’s similar to what I had before, but I had to re-work the region matching. But since I repeat by item, I have the luxury to know I am in a specific region for this particular interface and I use that information.

The relatively hard stuff about the left outer join remain the same as explained early July section. I also put the same data link as the left-most column has, just to make things easy for the user.

New column: allocated bandwidth

data = from(bucket: "${bucket}")
  |> range(start: -2d, stop: now()) // take last two days in case one run had an error
  |> filter(fn: (r) =>
    r._measurement == "vedge_stat" and
r._field == "speed" and r.item == "${item}"
  ) 
  |> last() // but now we just keep the last values
  |> keep(columns: ["_value"])
  |> map(fn: (r) => ({r with _value: r._value * 0.000001})) // bps to mbps, roughly
  |> map(fn: (r) => ({r with valueString: string(v: r._value ) + " mbps"})) // create a string field
  data

Not too difficult once you know the basics, right? It takes advantage of having the speed column in the vedge_stat measurement.

Multi vedge graph based on interfaces, not items

Now where things really get squirrelly is the multi vedge graph dashboard that is linked to the WAN Report dashboard. Whereas the WAN report lists each item separately and thus ingress and egress are shown in their own separate rows for a given interface, I’ve been asked to combine ingress and egress for the graph view. Plus show the allocated bandwidth for good measure! How to do all that??

Well, I found a way.

I created a new template variable iface. The other template variables are pretty much the same. iface is defined as follows.

import "regexp"
from(bucket: "${bucket}")
|> range (start: -${math_days}d)
|> filter(fn:(r) => r._measurement == "vedge_stat")
//|> filter(fn:(r) => r.region == "${Region}")
|> filter(fn:(r) => contains(value: r.region, set: ${Region:json}))
|> filter(fn:(r) => contains(value: r.country, set: ${country:json}))
|> filter(fn:(r) => contains(value: r.SSID, set: ${SSID:json}))
|> filter(fn:(r) => contains(value: r.GMT_offset, set: ${GMT_offset:json}))
|> filter(fn:(r) => r._field == "${math}" and r._value >= ${threshold})
// next line will remove the _ingress or _egress at the end of the item name
|> map(fn: (r) => ({r with iface: regexp.replaceAllString(r: /_[ine]{1,2}gress/, v: r.item, t:"")}))
|> group()
|> distinct(column: "iface") // just show distinct interface names

So I took advantage of the common elements in item that I wanted to group together, namely, everything other than ingress / egress, which appear last in the item name. So they are combined via RegEx manipulations. But that’s not the last of my tricks…

Time Series

So I use the time series display, of course. The main Flux query which does the magic is this one:

import "regexp"
iface2 = regexp.quoteMeta(v: "${iface}") // this escapes our . and / characters
ifaceRegex = regexp.compile(v: iface2) // item will be matched against this iface regex
data = from(bucket: "${bucket}")
  |> range(start: v.timeRangeStart, stop: v.timeRangeStop)
  |> filter(fn: (r) =>
    r._measurement == "vedge" and
    r._field == "value" and r.item =~ ifaceRegex
    )
  |> keep(columns: ["_time","_value","item"])
speeddata = from(bucket: "${bucket}") //get the speed data
  |> range(start: v.timeRangeStart, stop: v.timeRangeStop)
  |> filter(fn: (r) =>
    r._measurement == "vedge_stat" and
    r._field == "speed" and r.item =~ ifaceRegex
  )
  |> map(fn: (r) => ({r with item: string(v: "Allocated bandwidth")}))
  |> keep(columns: ["_time","_value","item"])
  |> last()
3values = union(tables: [data,speeddata])
3values

Two items will RegEx match against one iface and produce two time series, assuming we keep the right columns. The visualization is repeated by variable iface. I treat the speed like a third time series even though it really is a single number. And I arrange for it to have a nice name: Allocated bandwidth.

Why treat allocated bandwidth (speed) as a time series? By doing this I knew it would assure that it always gets drawn in the graph! This solved the problem I had until now wherein the nice red dashed line, which ws the threshold, wasn’t visible on the time series graph if the interface usage data was well below that value, which it often is. Now the red dashed thrshold line is always drawn in every graph.

Well, that’s query A. Then there’s query B:

import "regexp"
iface2 = regexp.quoteMeta(v: "${iface}")
ifaceRegex = regexp.compile(v: iface2) // item will be matched against this iface regex
data = from(bucket: "${bucket}")
  |> range(start: -2d, stop: now())
  |> filter(fn: (r) =>
    r._measurement == "vedge_stat" and
    r._field == "speed" and r.item =~ ifaceRegex
  )
  |> last()
  |> drop(columns: ["item","category","region","GMT_offset","country","SSID"])
data

Query B is to get the speed. Query B is used in a Transform:

This transform helps us draw a red dashed line for the speed

So this transform dynamically sets a threshold and we show Thresholds as lines (dashed).

Wait. But there’s more! Remember this speed is also a “series” based on query A. But it’s really a fake series. But I cleverly arranged for this fake series to have the same color as my red dashed line showing the thrshold via an override!

Override to give Allocated bandwidth “series” same color as threshold

In addition I managed to provide a link to the single item heatmap, which also required yet another trick. That link is:

/d/fdabsfdb-hbs-9jnd-7d3/single-vedge-heat-map-v2-1?orgId=1&${Region:queryparam}&${threshold:queryparam}&${math:queryparam}&${math_days:queryparam}&var-item=${__field.labels.item}

Site Bandwidth, Aug 2023

This is another awesome visualization I developed which rehashes much of what you’ve seen above, but introduces a brand new feature: the ability to display different time series depending on a drop-down list!

Dashboard with SUMMARY INFO
Dashboard with DETAILS

Here are the nitty gritty details.

Here are all the variable definitions, bunched together with a comment telling you the variable name.

// Region - custom, multi-value variable
AP,EU,NA,SA
// bucket - custom, hidden variable
UC03
// SSID - query, multi-value variable
import "strings"
from(bucket: "${bucket}")
|> range (start: -12h)
|> filter(fn:(r) => r._measurement == "SSID_bw")
|> map(fn: (r) => ({r with region: strings.substring(v: r.SSID, start: 0, end: 2)}))
|> filter(fn:(r) => contains(value: r.region, set: ${Region:json}))
|> group()
|> distinct(column: "SSID")
// ltype - query
import "strings"
from(bucket: "${bucket}")
|> range (start: -12h)
|> filter(fn:(r) => r._measurement == "SSID_bw")
|> map(fn: (r) => ({r with region: strings.substring(v: r.SSID, start: 0, end: 2)}))
|> filter(fn:(r) => contains(value: r.region, set: ${Region:json}))
|> group()
|> distinct(column: "SSID")
// bucket_uc02 - custom, hidden
UC02
// summary - custom
SUMMARY INFO,DETAILS

Awesomeness – query A

Most of the awesomeness of this dashboard is in query A:

import "strings"
import "array"
empty = array.from( rows: [{}]) // an empty table which we will use later on
dataDetails = from(bucket: "${bucket}")   // this will show all time series
  |> range(start: v.timeRangeStart, stop: v.timeRangeStop)
  |> filter(fn:(r) => r._measurement == "SSID_bw")
  |> filter(fn:(r) => r.SSID == "${SSID}")
  |> filter(fn:(r) => r.ltype == "${ltype}")
  |> filter(fn:(r) => r._field == "available_bw_mbps" or r._field == "capacity")
  |> drop(columns:["ltype","SSID"])
dataCapacity = from(bucket: "${bucket}") // this averages the capacity time series
  |> range(start: v.timeRangeStart, stop: v.timeRangeStop)
  |> filter(fn:(r) => r._measurement == "SSID_bw")
  |> filter(fn:(r) => r.SSID == "${SSID}")
  |> filter(fn:(r) => r.ltype == "${ltype}")
  |> filter(fn:(r) => r._field == "capacity")
  |> map(fn: (r) => ({r with "calculated_capacity": r._value}))
  |> keep(columns:["calculated_capacity","_time"])
  |> aggregateWindow(column: "calculated_capacity", every: 5m, fn: mean)
  |> keep(columns:["calculated_capacity","_time"])
dataBW = from(bucket: "${bucket}")  // to get the average of the available bw data
  |> range(start: v.timeRangeStart, stop: v.timeRangeStop)
  |> filter(fn:(r) => r._measurement == "SSID_bw")
  |> filter(fn:(r) => r.SSID == "${SSID}")
  |> filter(fn:(r) => r.ltype == "${ltype}")
  |> filter(fn:(r) => r._field == "available_bw_mbps")
  |> map(fn: (r) => ({r with "average_available_bw": r._value}))
  |> keep(columns:["average_available_bw","_time"])
  |> aggregateWindow(column: "average_available_bw", every: 5m, fn: mean)
  |> keep(columns:["average_available_bw","_time"])
speeddata = from(bucket: "${bucket_uc02}") //get the speed data
  |> range(start: -1d, stop: v.timeRangeStop)
  |> filter(fn: (r) => r._measurement == "vedge_stat")
  |> filter(fn: (r) => r._field == "speed" 
    and r.SSID == "${SSID}" and strings.containsStr(v: r.item, substr: "${ltype}")
  )
  |> map(fn: (r) => ({r with "true capacity": r._value / 1000000.0})) // convert bps to mbps
  |> map(fn: (r) => ({r with _time: v.timeRangeStop}))
  |> keep(columns: ["true capacity","_time"])
  |> last(column:"true capacity")
if "${summary}" == "SUMMARY INFO" then dataBW else dataDetails
if "${summary}" =="SUMMARY INFO" then dataCapacity |> yield(name: "2") else empty |> yield(name: "empty")
speeddata |> yield(name: "speeddata")

Techniques developed for this dashboard

Time Series displayed depends on drop-down selection

Using conditional logic on the variable summary, we display either summary info, i.e., the average of two series together, or the details, i.e., each available bandwidth and capacity.

Averaging two different time series

I average two time series together for outputting the summary info.

Mulitple time series (tables) shown resulting from semi-complex selection criteria for each

Using the yield function permits me to output a second (or third, fourth, etc) time series after I have outputted the first one.

Conditional logic to determine output of an entire table

As far as I can tell it’s not documented, but it seems to work to use if…then to output a table (as opposed to setting a variables’s value). But as you are required to have a then clause (for some reason), I needed to purposefully create an empty table!

This probably indicates I did things in a less-than-optimally-efficient manner, but oh well. It’s coding that does the job.

Empty table

Using my favorite array.from, I show how to quickly and compactly create an empty table.

Others already described

Query B used to determine the threshold based on interface speed data – that’s already been described in my WAN Report dashboard.

Inefficiency

You may realize I’m calculating time series which may never get displayed, which is true. But it isn’t much data and there isn’t any noticeable penalty so having all my time series at the ready is useful because when a user does change from SUMMARY INFO to DETAILS, the change in the visualization is instant, and hence provides a great user experience.

So, yes, I could have developed two different dashboards, but I like this approach better.

How to use a time series visualization with legend

I’m sure everyone else knows this but until now I never worked with multiple time series, I never displayed a legend, etc, so I didn’t know until I discovered by accident that you click on the series name in the legend and it makes the other series disappear!

And to bring them back hold the CTRL key down before clicking.

To bring up the link is kind of tricky. Your mouse pointer has to be really close to a data point I would say before it works.

Vertical alignment of multiple time series visualizations

This is technically not possible! So you may get kind of close, but not precisely. I was trying to line up things with different units when I noticed they didn’t align! Devices was in one visualization and latency as measured in ms in another. What I did was to add a label to my Devices plot (simply called Devices) to push its start time to the right which made it better aligned (but not perfectly) with the latency visualization above it.

Usage of Zabbix data

There is direct support for the Zabbix api. I use that. But my items from Zabbix provided the bandwidth used by an IPSEC tunnel, but not the percentage used. How to pull the capacity of the Interent circuit from another item and do some simple arithmetic? As with many things in Grafana, this seems to be much more difficult than it should be.

All I could think of is to pull the cpacity from the Zabbix item into a template variable. Then, in yet another template variable, take its reciprocal and multiply by 100! Then finally in the Grafana time series visualization, scale the IPSEC bw used by this reciprocal/percentage variable to come up with the percentage of capacity being used. A bit ugly, but it did work for me. The first variable is speed. The next one I call reciprocalSpeed:

import "array"
speedf = float(v: ${speed})
recip = 100. / speedf // but make it a percentage 
arr = [{valueString: recip}]
array.from(rows: arr)

The reason I felt forced to go this route is that once you choose Zabbix as your source, there is very little manipulation you get to do. Scale is one of the few things available to you.

Executive View for China example

I don’t know how much time I have to go into the details. But a very important lesson or two is to be learned by understanding what I did in creating this dashboard.

And it’s totally awesome if I do say so. I want to focus on the section labelled Issues, where I learned a very helpful technique.

You see that section where it shows Good Good Good? That is the network health scoring for one particular site, Beijing in this case. It comes about by doing some math with the flux expression. A health number from 0 – 100 is calculated, then Value Mappings are used in the Stat visualization to convert the numbers into one of three words, Good, Degraded or Bad, and the word is colored green for Good, orange for Degraded and red for Bad. The result for a different site is displayed by choosing it in the Site drop-down list.

But the Issues section, which was the last sectionto be added, is dependent on those dynamic calculations which are done on each site. The target audience for this dashboard said it’s great and all, but can we provide in one screen a list of all sites currently experiencing degraded or bad network performance?? A reasonable enough ask, but how to do it in Grafana?? I was ready to give up and consign myself to pre-calculating all those health scors for all sites on a regular basis, and put that into a pipeline job which would hjave spit out the results to a new InfluxDB measurement. That would have been a lot of work, honestly. But some youngster (under 30) looked at my existing code and had a flux language fix for my dilemma in a working demo with a couple hours! Then I further refined his demo to make it look even better.

I’d say the main thing is to use a pivot on the data. With the pivot it can spit out only those sites with either degraded or bad health, put that into a hidden template variable which in turn can be used as an iterator to make my nice stat visualizations, similar to what I’ve done previously and shown in detail above.

Just to show the complexity of this template variable, here is how it is calculated. In the end I see did not need to use his pivot although it was a good technique to master.

import "date" // a superior mind figured out the hard stuff in this query
import "join"
startBaseline = date.add(d: -${baselineDelta}h, to: v.timeRangeStop) // eventually make this v,timeRangeStart !!!
stopBaseline = date.add(d: 24h, to: startBaseline)
startDelayed = date.add(d: -${timeRecent}s, to: v.timeRangeStart)
stopDelta = date.add(d: -${timeRecent}s, to: v.timeRangeStop)
abaseresults = from(bucket: "${bucket}")  // to get the average latency ascore. Note the site filter was removed
  |> range(start: startBaseline, stop: stopBaseline)
  |> filter(fn:(r) => r._measurement == "Agents")
  |> filter(fn:(r) => r._field == "latency")
  |> group(columns: ["SSID","ltype"])  
  |> keep(columns:["_value", "SSID","ltype"])
  |> mean() // produces a single number
alastresults = from(bucket: "${bucket}")  // to get the average latency ascore
  |> range(start: stopDelta, stop: v.timeRangeStop)
  |> filter(fn:(r) => r._measurement == "Agents")
  |> filter(fn:(r) => r._field == "latency")
  |> group(columns: ["SSID","ltype"])    
  |> keep(columns:["_value", "SSID","ltype"])
  |> mean() // produces a single number
bbaseresults = from(bucket: "${bucket}")  // to get the average latency ascore
  |> range(start: startBaseline, stop: stopBaseline)
  |> filter(fn:(r) => r._measurement == "Agents")
  |> filter(fn:(r) => r._field == "packet_loss")
  |> group(columns: ["SSID","ltype"])  
  |> keep(columns:["_value", "SSID","ltype"])
  |> mean() // produces a single number
blastresults = from(bucket: "${bucket}")  // to get the average latency ascore
  |> range(start: stopDelta, stop: v.timeRangeStop)
  |> filter(fn:(r) => r._measurement == "Agents")
  |> filter(fn:(r) => r._field == "packet_loss")
  |> group(columns: ["SSID","ltype"])  
  |> keep(columns:["_value", "SSID","ltype"])
  |> mean() // produces a single number
myjoin = join.left( // join abaseresults and alastresults into one stream. This is super difficult way to do things!
  left: abaseresults,
  right: alastresults,
  on: (l,r) => l.SSID == r.SSID and l.ltype == r.ltype,
  as: (l, r) => ({abase: l._value, alast: r._value, SSID: l.SSID, ltype: l.ltype  })
)
aresults = myjoin
 |> map(fn: (r) => ({r with ascore: if r.abase > ${aCutoff} then 100.0*(1.0 - ${aHighCoeff}*(r.alast - r.abase)/r.abase) else 
    100.0*(1.0 - ${aLowCoeff}*(r.alast - r.abase)/r.abase) }))
myjoinb = join.left( // join bbaseresults and blastresults into one stream
  left: bbaseresults,
  right: blastresults,
  on: (l,r) => l.SSID == r.SSID and l.ltype == r.ltype,
  as: (l, r) => ({bbase: l._value, blast: r._value,  SSID: l.SSID ,ltype:l.ltype })
)
bresults = myjoinb
 |> map(fn: (r) => ({r with bscore: if r.blast < ${bCutoff} then 100.0 - ${bSimpleCoeff}*r.blast else 100.0*(1.0 - ${bLowCoeff}*(r.blast*${bHighCoeff} - r.bbase)/(r.bbase + 0.001)) })) // avoid / 0.0 !
ares2 = aresults
  |> keep(columns:["ascore","SSID","ltype"])
bres2 = bresults
  |> keep(columns:["bscore","SSID","ltype"])
myjoinc = join.left( // join ares2 and bres2 into one stream. This is super difficult way to do things!
  left: ares2,
  right: bres2,
  on: (l,r) => l.SSID == r.SSID and l.ltype == r.ltype,
  as: (l, r) => ({ascore: l.ascore, bscore: r.bscore, SSID:l.SSID, ltype: l.ltype})
) 
finalresults = myjoinc
 |> map(fn: (r) => ({r with tmpfinal: (r.ascore + r.bscore)/2.0 }))
 |> map(fn: (r) => ({r with finalscore: if r.tmpfinal > 100.0 then 100 else if r.tmpfinal < 0.0 then 0 else int(v: r.tmpfinal) }))
// |> filter(fn:(r) => r.finalscore <= 93) // for debugging
  |> filter(fn:(r) => r.finalscore <= ${deg_thresh})
  |> group()
  |> keep(columns: ["SSID"])
  |> distinct(column: "SSID")
finalresults

Of course this refers to a bunch of pre-existing template variables, but that’s kind of obvious so no need to define all of them in detail.

Just to mention it, the formula compares the current value for latency against the value as determined by a baseline taken as the average for this same day one week ago. Similarly for the packet loss. Those two numbes are combined for a composite score.

The latency and packet loss values are taken from an influxdb measurement which itself is fed by a pipeline job which uses the ThousandEyes api to get those values from our Thousandeyes enterprise agents.

Code re-use? Forget about it! This same formula has to be re-entered in other sections of my dashboard.

A word on RegEx

It should be easy, but I’ve had a hard time with it. Here, lately, is my syntax which is working for me. It’s changed from the examples above (note the regexp.compile):

import "regexp"
RegionRegex = regexp.compile(v: "^" + "${Region}") // just pick up sites from chosen region
from(bucket: "${bucket}")
|> range (start: -2d)
|> filter(fn:(r) => 
 r._measurement == "SSID" and
 r.SSID =~ RegionRegex)
|> keep(columns:["SSID"])

Repeating myself

Switching between regular and Tables view is essential to know what the heck is going on. A larger monitor really helped as well – I never really used one until Grafana more-or-less required it of me! Adding fake extra columns to a table can be really useful and the only way to do certain things, and it’s really not hard once you know how. When you’re stuck, figure a way to reduce your query to its essence and test it within query explorer, perhaps defining a few of your variables by hand at the top (which is time-cnosuming but sometimes necessary).

How to dump data into a CSV

If you mouse over a visualization and click on the three dots, choose Inspect > Data. You are given the option to save as CSV.

However, for my multiple time series there seems to be a bug and it arbitrarily picks just one of the time series and does not permit to select the other time series. I reproduced this on play.grafana.org. So I formally reported the bug to the github site. They fixed it in v 10.1.2 (release date 9/18/23)! My experience was good, meaning, they did not chew me out for not being an insider with intricate knowledge of OSS development protocol. Furthermore, it works better than I guess it did in the past. I read about limits to the number of data points you could save this way, maybe 500. But I tested with up to 2600 data points and got all of them into a CSV file. So maybe the upper limit is the limit of the number of points you can display on a graph (which I forget)?

Invalid: error @8:6 – 8:13 record is missing label substrings

If I try to use the strings function with importing it I get a pretty obvious error undefined idetifier strings with the row number and column numbers. But If I did include strings and then use the wrong function, the error is more subtle. I was referring to substrings, not substring, and got the error in the header.

Forget to import join?

Well, then the error might really throw you. I get:

 invalid: error @42:10-42:14: expected { A with left: ( as: (l: {B with _value: C}, r: {D with _value: E}) => {holiday_flag: E, _value: C}, left: stream[F], on: (l: {G with day: H}, r: {I with day: J}) => bool, right: stream[K], ) => C, } (record) but found (<-tables: L, ?method: string, ?on: [string]) => stream[M] (function)

Not tremendously obvious and I almost gave up. Then I saw I had forgotten to do an import “join”! It’s hard to complain when the error reporting is otherwise quite good.

All time series have disappeared except the one you clicked on

This is not well documented. It’s kind of sort of a bug. Say you are displaying multiple time series with a legend. All good in the beginning. Then you click on the legend key to see just one time series by itself. All good. Then you work on the dashboard, save it, etc and after that, every single time you display it only that one time series displays even though you didn’t specify to save current variables. You can click on other time series in other time series, but by default they’re all hidden.

This happened to me. I had created a series override. Well, clicking on a key in the legend creates a second override! So once I saved my work I was doomed. Just deleted the unwatned override and voila, all will behave as desired once again!

Your stat visualization shows a bunch of numbers in a single square, not one number per square

I lost quite some time on this one. To get those individual little squares, you have to set your color mode to Background Gradient! Really. Try it.

Is your visualization showing too many time series?

After I’d been away from it for a month and forgot everything, I found I was getting more time series with my latest query than I wanted to display. The measurement had multiple fields, which is the key takeaway here. Fields are not columns and cannot simply be dropped. And I didn’t have a filter limiting the fields. To show only the fields you want you need to match your field names explicitly in your filter function. For my latest that looks like this:

 |> filter(fn:(r) => r._field == "available_bw_mbps" or r._field == "capacity")
Your key:value template variables just output a string as key:value

Though it is counterintuitive, to get key:value variables to work you need to put a space before and after the colon! Go figure. E.g., China : APCN, EU : EU

Flux: learn to test units

You know how I created intermediate tables along the way? Any of them could have errors. So a way to do unit testing is to comment out the end table – 3values in this case – and put one of the intermediate tables such as data, or holidays, or myjoin or meanTbl. You basically better test them all to be really sure. And just use the Tables view to look at it.

And learn to hit Refresh frequently as you enter your Flux language query.

And use Data Explorer

I also used the Data Explorer to test things in isolation. But then I had to add some lines on top to assign values to missing template variables. But I found this approach invaluable when I was developing new features, so practice getting good at it.

Time series visualization represents an approximation

I was wondering why my data recorded at 15:05:05 had one value, and the time series visualization showed a completely different value for 15:05. It can only be that time series aggregates the data into time windows. So the point recorded on 15:05 (five minute time window) was actually taken from the data it received at 15:00:04! And if you export the data from the visualization you get that same approximate, or more properly aggregated, data, not the actual data points.

Multiple y axes

Yes, you can do it. use overrides on one of the variables. Then specify placement -> right and units -> whatever you like.

How to get mouse to display multiple series values, not just one

This can also be done! Select Tooltip mode All. This is normally the setting you want so use this tip often!

How to suppress displaying a time series with no data

This is now available with overrides. Choose Fields with Values > All Nulls > override property: Series: hide in area > click Tooltip, Viz, legend

Similarly if you want to not draw a time series with all zeros just choose Fields with Values all Zeros.

How to suppress the decimal in time series for numbers less than 100

In a time series viz standard options, Decimals is normally set to auto and will display 72.0 when you’re only inputting whole numbers! Change it to 0 and you will see 72.

InfluxDB tips for InfluxDB 3.0 serverless

You can easily extend your measurement design and add additional fields and tags to it even after you’ve started using it, which is cool.

However, once you fill a measurement with any amount of data, and then realize you made a mistake with the data type of one of your fields, it’s too late! As far as I know there’s no fixing it and you just have to start over with a new measurement (it happened to me).

The supported data types of a field are:

  • Integer
  • Unsigned integer
  • Float
  • String
  • Boolean
Timeouts writing to InfluxDB cloud

I kept getting timeouts once a day writing data to Influxdb in the cloud. Examine the retries definition and how it is used in influx_modules.py in the feed_influx section of this post above. I’m not sure why it works, but ever since I threw that in there (a week ago) I haven’t had a timeout error.

Complaints

I am not comfortable with the flux query documentation. Nor the Grafana documentation for that matter. They both give you a taste of what you need, without many details or examples. For instance it looks like there are multiple syntaxes available to you using flux. I just pragmatically developed what works.

Conclusion

Well, this statred as amateur hour for InfluxDB and Grafana. But even amateurs can produce decent results if the tools are adequate. And that’s the case here. The results I am producing were “good enough” for our purposes – internal usage – to begin with, and they’ve only gotten better as I honed my skills.

I am content if no one reads this and it only serves as my own documentation. But perhaps it will help someone facing similar issues. Unfortunately one of the challenges is asking a good question to a search engine when you’re a newbie and haven’t mastered the concepts.

But without too too much effort I was able to master enough Grafana to create something that will probably be useful to our vendor management team. Grafana is fun and powerful. There is a slight lack of examples and the documentation is a wee bit sparse. InfluxDB seems to be a good back-end database for Grafana to use. The flux query language, while still obscure to me, was sufficiently powerful enough for me to get my basic goals accomplished.

References and related

I developed a blurring program in python which I used to present most of these imaages.

My favorite Flux tips

Lots and lots of examples are provided at https://play.grafana.com, and the best thing is that you can inspect and change stuff. Not sure how they do that…

InfluxDB Cloud serverless docs

Categories
Network Technologies

How to force snmpwalk to convert strings to numeric OIDs

Intro

It’s a little hard to find this information on the Internet, so I’m amplifying the correct answer here by using my blog.

The details

I’m not super-competent with MIBs and such, but I manage for my purposes with my basic understanding. I have access to an F5 bigip with various IPSEC tunnels on it. I want to use Zabbix to check the status of those tunnels. So I do an SMPwalk like this:

snmpwalk -v3 … -c public 127.0.0.1 F5-BIGIP-SYSTEM-MIB::sysIpsecSpdStatTunnelState

which produces output like this line:

F5-BIGIP-SYSTEM-MIB::sysIpsecSpdStatTunnelState.”/Common/tunnel-01″.58401 = STRING: up

But I cannot take that as it is and use it in an snmpget like this:

snmpget -v3 … -c public 127.0.0.1 F5-BIGIP-SYSTEM-MIB::sysIpsecSpdStatTunnelState.”/Common/tunnel-01″.58401

That produces an error like this:

Unknown Object Identifier (Index out of range: /Common/tunnel-01 (sysIpsecSpdStatTrafficSelectorName))

So we need to convert the string into a numeric OID. But how?

The answer

Use the -On switch as an additional argument in your snmpwalk.

You will get a scary long OID, but it will at least be numeric.

Gonig further

You can then deconstruct the response and reconstitute the section at the beginning with a nice name. For my F5 example

.1.3.6.1.4.1.3375.2.1.2.17.1.3.1.14

becomes

F5-BIGIP-SYSTEM-MIB::sysIpsecSpdStatBytes

I think. Then preserve the following digits as is.

Conclusion

We have shown how to output a numeric OID from an snmpwalk. This, specifically, is sueful in converting a string embedded in the output into a numeric OID, which may then be used by other SNMP applications such as Zabbix which may or may not have the MIB file loaded. The secret is simply to use the -On switch in snmpwalk.

References and related

My Zabbix FAQ – questions you wish they had answered, can be very helpful

Categories
Admin DNS Firewall Network Technologies TCP/IP

The IT detective agency: named times out tcp queries

Intro

I’ve been reliable running ISC’s BIND server for eons. Recently I had a problem getting my slave servers updated after a change to the primary master. What was going on there?

The details

This was truly a team effort. I saw that the zone file had differing serial numbers on the master versus the slave servers. My attempts to update via an rndc refresh zone was having no effect.

So I tried a zone transfer by hand: dig axfr drjohnstechtalk.com @50.17.188.196

That timed out!

Yet, regular dns qeuries went through fine: dig ns drjohnstechtakl.com @50.17.188.196

I thought about it and remembered zone transfers use TCP whereas standard queries use UDP. So I tried a TCP-based simple query: dig +tcp ns drjohnstechtalk.com @50.17.188.196. It timed out!

So of course one suspects the firewall, which is reasonable enough. And when I looked at the firewal I found some funny drops, though i cuoldn’t line them up exactly with my failed tests. But I’m not a firewall expert; I just muddle through.

The next day someone from the DNS group asked how local queries behaved? Hmm. never tried that. So I tried it: dig +tcp ns drjohnstechtalk.com @localhost. That timed out as well! That was a brilliant suggestion as we now could eliminate the firewall and all that complexity from the equation. Because I had tried to do packet traces on two different machines at the same time and line up the results. It wasn’t easy.

The whole issue was very concerning to us because we feared our secondaries would be unable to pudate their slave zones and ultimately time them out. The result would be devastating.

We have support, fortunately. A company that hearkens frmo the good old days, with real subject matter experts. But they’re extremely busy. We did not get a suggestion for a couple weeks. But eventually we did. They had seen this once before.

named time to respond to TCP-based queries

The above graph is from a Zabbix monitor showing how long it takes that dns server to respond to that simple query. 6 s is a time-out. I actually set dig to timeout at 2 s, but in wall-clock time it actually takes 6 s.

The fix

We removed this line from the options block of named.conf:

keep-response-order {any; };

The info fmo the experts is that most likely that was configured as a workaround to CVE-2019-6477 but that issue was fixed since 9.15.6.

Conclusion

We encountered the named daemon in a situation where it was unable to respond to TCP-based DNS queries and hence unable to do zone transfers. So although most queries use UDP, this was a serious issue for us and prevented zones from being updated on all authoritative nameservers.

As is the case with so many modern IT problems, the effect was not black or white. Failures were intermittent, and then permanent. A restart fixed ths issue (forgot to mention so far!). But we involved an expert to find the root cause and it was the presence of a single configuration line in our named.conf. After removing that all was good.

Categories
Admin JavaScript Network Technologies

Practical Zabbix examples

Intro
I share some Zabbix items I’ve had to create which I find useful.

Low-level discovery to discover IPSEC tunnels on an F5 BigIP

IPSec tunnels are weird insofar as there is one IKE SA but potentially lots of SAs – two for each traffic selector. So if your traffic selector is called proxy-01, some OIDs you’ll see in your SNMP walk will be like …proxy-01.58769, …proxy-01.58770. So to review, do an snmpwalk on the F5 itself. That command is something like

snmpwalk -v3 -l authPriv -u proxyUser -a SHA -A shaAUTHpwd -x AES -X AESpwd -c public 127.0.0.1 SNMPv2-SMI::enterprises >/tmp/snmpwalk-ent

Now…how to translate this LLD? In my case I have a template since there are several F5s which need this. The template already has discovery rules for Pool discovery, Virtual server discovery, etc. So first thing we do is add a Tunnel discovery rule.

Tunnel Discovery Rule

The SNMP OID is clipped at the end. In full it is:

discovery[{#SNMPVALUE},F5-BIGIP-SYSTEM-MIB::sysIpsecSpdStatTrafficSelectorName]

Initially I tried something else, but that did not go so well.

Now we want to know the tunnel status (up or down) and the amount of traffic over the tunnel. We create two item prototypes to get those.

Tunnel Status Item prototype

So, yes, we’re doing some fancy regex to simplify the otherwise ungainly name which would be generated, stripping out the useless stuff with a regsub function, which, by the way, is poorly documented. So that’s how we’re going to discover the statuses of the tunnels. In text, the name is:

Tunnel {{#SNMPINDEX}.regsub(“\”\/Common\/([^\”]+)\”(.+)”,\1\2)} status

and the OID is

F5-BIGIP-SYSTEM-MIB::sysIpsecSpdStatTunnelState.{#SNMPINDEX}

And for the traffic, we do this:

Tunnel Traffic Item prototype

I learned how to choose the OID, which is the most critical part, I guess, from a combination of parsing the output of the snmpwalk plus imitation of those other LLD item prortypes, which were writtne by someone more competent than I.

Now the SNMP value for traffic is bytes, but you see I set units of bps? I can do that because of the preprocessing steps which are

Bytes to traffic rate preprocessing steps

Final tip

For these discovery items what you want to do is to disable Create Enabled and disable Discover. I just run it on the F5s which actually have IPSEC tunnels. Execute now actually works and generates items pretty quickly.

Using the api with a token and security by obscurity

I am taking the approach of pulling the token out of a config file where it has been stored, base85 encoded, because, who uses base85, anyway? I call the following script encode.py:

import sys
from base64 import b85encode

s = sys.argv[1]
s_e = s.encode('utf-8')

s64 = b85encode(s_e)
print('s,s_e,s64',s,s_e,s64)

In my case I pull this encoded token from a config file, but to simplify, let’s say we got it from the command line. This is how that goes, and we use it to create the zapi object which can be used in any subsequent api calls. That is the key.

from base64 import b85decode
import sys

url_zabbix = sys.argv[1]
t_e = sys.argv[2] # base85 encoded token

# Login Zabbix API
t_b = t_e.encode('utf-8')
to_b = b85decode(t_b)

token_zabbix = to_b.decode('utf-8')
zapi = ZabbixAPI(url_zabbix)
zapi.login(api_token=token_zabbix)
...

So it’s a few extra lines of code, but the cool thing is that it works. This should be good for version 5.4 and 6.0. Note that if you installed both py-zabbix and pyzabbix, your best bet may be to uninstall both and reinstall just pyzabbix. At least that was my experience going from user/pass to token-based authentication.


Convert DateAndTime SNMP output to human-readable format

Of course this is not very Zabbix-specific, as long as you realize that Zabbix produces the outer skin of the function:

function (value) {
// DrJ 2020-05-04
// see https://support.zabbix.com/browse/ZBXNEXT-3899 for SNMP DateAndTime format
'use strict';
//var str = "07 E4 05 04 0C 32 0F 00 2B 00 00";
var str = value;
// alert("str: " + str);
// read values are hex
var y256 = str.slice(0,2); var y = str.slice(3,5); var m = str.slice(6,8); 
var d = str.slice(9,11); var h = str.slice(12,14); var min = str.slice(15,17);
// convert to decimal
var y256Base10 = +("0x" + y256);
// convert to decimal
var yBase10 = +("0x" + y);
var Year = 256*y256Base10 + yBase10;
//  alert("Year: " + Year);
var mBase10 = +("0x" + m);
var dBase10 = +("0x" + d);
var hBase10 = +("0x" + h);
var minBase10 = +("0x" + min);
var YR = String(Year); var MM = String(mBase10); var DD = String(dBase10);
var HH = String(hBase10);
var MIN = String(minBase10);
// padding
if (mBase10 &lt; 10)  MM = "0" + MM; if (dBase10 &lt; 10) DD = "0" + DD;
if (hBase10 &lt; 10) HH = "0" + HH; if (minBase10 &lt; 10) MIN = "0" + MIN;
var Date = YR + "-" + MM + "-" + DD + " " + HH + ":" + MIN;
return Date;

I put that javascript into the preprocessing step of a dependent item, of course.

All my real-life examples do not fill in the last two fields: +/-, UTC offset. So in my case the times must be local times. But consequently I have no idea how a + or – would be represented in HEX! So I just ignored those last fields in the SNNMP DateAndTime which otherwise might have been useful.

Here’s an alternative version which calculates how long its been in hours since the last AV signature update.

// DrJ 2020-05-05
// see https://support.zabbix.com/browse/ZBXNEXT-3899 for SNMP DateAndTime format
'use strict';
//var str = "07 E4 05 04 0C 32 0F 00 2B 00 00";
var Start = new Date();
var str = value;
// alert("str: " + str);
// read values are hex
var y256 = str.slice(0,2); var y = str.slice(3,5); var m = str.slice(6,8); var d = str.slice(9,11); var h = str.slice(12,14); var min = str.slice(15,17);
// convert to decimal
var y256Base10 = +("0x" + y256);
// convert to decimal
var yBase10 = +("0x" + y);
var Year = 256*y256Base10 + yBase10;
//  alert("Year: " + Year);
var mBase10 = +("0x" + m);
var dBase10 = +("0x" + d);
var hBase10 = +("0x" + h);
var minBase10 = +("0x" + min);
var YR = String(Year); var MM = String(mBase10); var DD = String(dBase10);
var HH = String(hBase10);
var MIN = String(minBase10);
var Sigdate = new Date(Year, mBase10 - 1, dBase10,hBase10,minBase10);
//difference in hours
var difference = Math.trunc((Start - Sigdate)/1000/3600);
return difference;

Calculated bandwidth from an interface that only provides byte count
Again in this example the assumption is you have an item, probably from SNMP, that lists the total inbound/outbound byte count of a network interface – hopefully stored as a 64-bit number to avoid frequent rollovers. But the quantity that really excites you is bandwidth, such as megabits per second.

Use a calculated item as in this example for Bluecoat ProxySG:

change(sgProxyInBytesCount)*8/1000000/300

Give it type numeric, Units of mbps. sgProxyInBytesCount is the key for an SNMP monitor that uses OID

IF-MIB::ifHCInOctets.{$INTERFACE_TO_MEASURE}

where {$INTERFACE_TO_MEASURE} is a macro set for each proxy with the SNMP-reported interface number that we want to pull the statistics for.

The 300 in the denominator of the calculated item is required for me because my item is run every five minutes.

Alternative
No one really cares about the actual total value of byte count, right? So just re-purpose the In Bytes Count item a bit as follows:

  • add preprocessing step: Change per second
  • add second preprocessing step, Custom multiplier 8e-6

The first step gives you units of bytes/second which is less interesting than mbps, which is given by the second step. So the final units are mbps.

Be sure to put the units as !mbps into the Zabbix item, or else you may wind up with funny things like Kmbps in your graphs!

Creating a baseline

Even as of Zabbix v 5, there is no built-in baseline item type, which kind of sucks. Baseline can mean many different things to many people – it really depends on the data. In the corporate world, where I’m looking at bandwidth, my data has these distinct characteristics:

  • varies by hour-of-day, e.g., mornings see heavier usage than afternoons
  • there is the “Friday effect” where somewhat less usage is seen on Fridays, and extremely less usage occurs on weekends, hence variability by day-of-week
  • probably varies by day of month, e.g., month-end closings

So for this type of data (except the last criterion) I have created an appropriate baseline. Note I would do something different if I were graphing something like the solar generation from my solar panels, where the day-of-week variability does not exist.

Getting to the point, I have created a rolling lookback item. This needs to be created as a Zabbix Item of type Calculated. The formula is as follows:

(last(sgProxyInBytesCount,#1,1w)+
last(sgProxyInBytesCount,#1,2w)+
last(sgProxyInBytesCount,#1,3w)+
last(sgProxyInBytesCount,#1,4w)+
last(sgProxyInBytesCount,#1,5w)+
last(sgProxyInBytesCount,#1,6w))/6

In this example sgProxyInBytesCount is my key from the reference item. Breaking it down, it does a rolling lookback of the last six measurements taken at this time of day on this day of the week over the last six weeks and averages them. Voila, baseline! The more weeks you include the more likely you are to include data you’d rather not like holidays, days when things were busted, etc. I’d like to have a baseline that is from a fixed time, like “all of last year.” I have no idea how. I actually don’t think it’s possible.

But, anyway, the baseline approach above should generally work for any numeric item.

Refinement

The above approach only gives you six measurements, hence 1/sqrt(6) ~ 40% standard deviation by the law of large numbers, which is still pretty jittery as it turns out. So I came up with this refined approach which includes 72 measurements, hence 1/sqrt(72) ~ 12% st dev. I find that to be closer to what you intuitively expect in a baseline – a smooth approximation of the past. Here is the refined function:

(avg(sgProxyInBytesCount,1h,1w)+
avg(sgProxyInBytesCount,1h,2w)+
avg(sgProxyInBytesCount,1h,3w)+
avg(sgProxyInBytesCount,1h,4w)+
avg(sgProxyInBytesCount,1h,5w)+
avg(sgProxyInBytesCount,1h,6w))/6

I would have preferred a one-hour interval centered around one week ago, etc., e.g., something like 1w+30m, but such date arithmetic does not seem possible in Zabbix functions. And, yeah, I could put 84600s (i.e., 86400 – 1800), but that is much less meaingful and so harder to maintain. Here is a three-hour graph whose first half still reflects the original (jittery) baseline, and last half the refined function.

Latter part has smoothed baseline in light green

What I do not have mastered is whether we can easily use a proper smoothing function. It does not seem to be a built-in offering of Zabbix. Perhaps it could be faked by a combination of pre-processing and Javascript? I simply don’t know, and it’s more than I wish to tackle for the moment.

Data gap between mulitple item measurements looks terrible in Dashboard graph – solution

In a Dashboard if you are graphing items which were not all measured at the same time, the results can be frustrating. For instance, an item and its baseline as calculated above. The central part of the graph will look fine, but at either end giant sections will be missing when the timescale of display is 30 minutes or 60 minutes for items measured every five minutes or so. Here’s an example before I got it totally fixed.

Zabbix item timing mismatch

See the left side – how it’s broken up? I had beguin my fix so the right side is OK.

The data gap solution

Use Scheduling Intervals in defining the items. Say you want a measurement every five minutes. Then make your scheduling interval m/5 in all the items you are putting on the same graph. For good measure, make the regular interval value infrequent. I use a macro {$UPDATE_LONG}. What this does is force Zabbix to measure all the items at the same time, in this case every five minutes on minutes divisible by five. Once I did that my incoming bandwith item and its corresponding baseline item aligned nicely.

Low-level Discovery

I cottoned on to the utility of this part of Zabbix a little late. Hey, slow learner, but I eventually got there. What I found in my F5 devices is that using SNMP to monitor the /var filesystem was a snap: it was always device 32 (final OID digit). But /var/log monitoring? Not so much. Every device seemed different, with no obvious pattern. Active and standby units – identical hardware – and some would be 53, the partner 55. Then I rebooted a device and its number changed! So, clearly, dynamically assigned and no way was I going to keep up with it. I had learned the numbers by doing an snmpwalk. The solution to this dynamically changing OID number is to use low-level discovery.

Tip: using zabbix_sender in a more robust fashion

We run the Zabbix proxies as pairs. They are not run as a cluster. Instead one is active and the other is a warm standby. Then we can upgrade at our leisure the standby proxy, switch the hosts to it, then upgrade the other now-unused proxy.

But our scripts which send results using zabbix_sender run on other servers. Their data stops being recorded when the switch is made. What to do?

I learned you can send to both Zabbix proxies. It will fail on the standby one and succeed on the other. Since one proxy is always active, it will always succeed in sending its data!

A nice DNS synthetic monitor

It would have been so easy for Zabbix to have built in the capability of doing synthetic DNS checks against your DNS servers. But, alas, they left it out. Which leaves it to us to fill that gap. Here is a nice and simple but surprisingly effective script for doing synthetic DNS checks. You put it in the external script directory of whatever proxy is monitoring your DNS host. I called it dns.sh.

#!/bin/sh
# arg1 - hostname of nameserver
# arg2 - DNS server to test
# arg3 - FQDN
# arg4 - RR type
# arg5 - match arg
# [arg6] - tcpflag # this argument is optional

# if you set DEBUG=1, and debug through zabbix, set item type to text
DEBUG=0
timeout=2 # secs - seems a good value
name=$1
nameserver=$2
record=$3
type=$4
match=$5
tcpflag=$6
[[ "$DEBUG" -eq "1" ]] && echo "name: $name, nameserver: $nameserver , record: $record , type: $type , match pattern: $match, tcpflag: $tcpflag"
[[ "$tcpflag" = "y" ]] || [[ "$tcpflag" = "Y" ]] && PROTO="+tcp"
# unless you set tries to 1 it will try three times by default!
MATCH=$(dig +short $PROTO +timeout=$timeout +tries=1 $type $record @${nameserver} )
[[ "$DEBUG" -eq "1" ]] && echo MATCHed line is $MATCH
return=0
[[ "$MATCH" =~ $match ]] && return=1
[[ "$DEBUG" -eq "1" ]] && echo return $return
echo $return

It gives a value of 1 if it matched the match expression, 0 otherwise.

Conclusion
A couple of really useful but poorly documented items are shared. Perhaps more will be added in the future.


References and related

https://support.zabbix.com/browse/ZBXNEXT-3899 for SNMP DateAndTime format

My first Zabbix post was mostly documentation of a series of disasters and unfinished business.

Blog post about calculated items by a true expert: https://blog.zabbix.com/zabbix-monitoring-with-calculated-items-explained/9950/

Low-level Discovery write-up: https://blog.zabbix.com/how-to-use-zabbix-low-level-discovery/9993/

Categories
Admin Network Technologies

Monitoring by Zabbix: a working document

Intro
I panned Zabbix in this post: DIY monitoring. But I have compelling reasons to revisit it. I have to say it has matured, but there remain some very frustrating things about it, especially when compared with SiteScope (now owned by Microfocus) which is so much more intuitive.

But I am impressed by the breadth of the user base and the documentation. But learning how to do any specific thing is still an exercise in futility.

I am going to try to structure this post as a problems encountered, and how they were resolved.

Current production version as of this writing?
Answer: 6.0

Zabbix Manual does not work in Firefox
That’s right. I can’t even read the manual in my version of Firefox. Its sections do not expand. Solution: use Chrome

Which database?
You may see references to MYSQL in Zabbix docs. MYSQL is basically dead. what should you do?

Zabbix quick install on Redhat

Answer
Install mariadb which has replaced MYSQL and supports the same commands such as the mysql from the screenshot. On my Redhat instance I have installed these mariadb-related repositories:

mariadb-5.5.64-1.el7.x86_64
mariadb-server-5.5.64-1.el7.x86_64
mariadb-libs-5.5.64-1.el7.x86_64

Terminology confusion
what is a host, a host group, a template, an item, a web scenario, a trigger, a media type?
Answer

Don’t ask me. When I make progress I’ll post it here.

Web scenario specific issues
Can different web scenarios use different proxies?
Answer: Yes, no problem. In really old versions this was not possible. See web scenario screenshot below.

Can the proxy be a variable so that the same web scenario can be used for different proxies?
Answer: Yes. Let’s say you attach a web scenario to a host. In that host’s configuration you can define a “macro” which sets the variable value. e.g., the value of HTTP_PROXY in my example. I think you can do the same from a template, but I’m getting ahead of myself.

Similarly, can you do basic proxy auth and hide the credentials in a MACRO? Answer: I think so. I did it once at any rate. See above screenshot.

Why does my google.com web scenario work whereas my amazon.com scenario not when they’re exactly the same except for the URL?
Answer: some ideas, but the logging information is bad. Amazon does not take to bots hitting it for health check reasons. It may work better to change the agent type to Linux|Chrome, which is what I am trying now. Here’s my original answer: Even with command-line curl I get an error through this proxy. That can’t be good:
$ curl ‐vikL www.amazon.com

...
NSS error -5961 (PR_CONNECT_RESET_ERROR)
* TCP connection reset by peer
* Closing connection 1
curl: (35) TCP connection reset by peer


My amazon.com web scenario is not working (status of 1), yet in dashboard does not return any obvious warning or error or red color. Why? Answer:
no idea. Maybe you have to define a trigger?

Say you’re on the Monitoring|latest data screen. Does the data get auto-updated? Answer: yes, it seems to refresh every 30 seconds.

In Zabbix Latest Data can you control the history displayed via url parameters? By default only one hour of history is displayed. Answer: There is an undocumented feature I have discovered which permits this. Let’s say your normal URL for your direct link to the latest data of item 1234 is https://drjohns.com/history.php?action=showgraph&itemids[]=1234. The modified version of that to display the last day of data is: https://drjohns.com/history.php?action=showgraph&from=now-1d&to=now&itemids[]=1234

In latest data viewing the graph for one item which has a trigger, sometimes the trigger line is displayed as a dashed line and sometimes not at all. Answer: From what I can tell the threshold line is only displayed if the threshold was entered as a number in the trigger condition. Strange. Unfortunate if true.

Why is the official FAQ so useless? Answer: no idea how a piece of software otherwise so feature-rich could have such a useless FAQ.

Zabbix costs nothing. Is it still actively supported? Answer: it seems very actively supported for some reason. Not sure what the revenue model is, however…

Can I force one or more web scenarios to be run immediately? I do this all the time in SiteScope. Answer: I guess not. There is no obvious way.

Suppose you have defined an item. what is the item key? Answer: You define it. Best to make it unique and use contiguous characters. I’m seeing it’s very important…

What is the equivalent to SiteScope’s script monitors? Answer: Either ssh check or external check.

How would you set up a simple PING monitor, i.e., to see if your host is up? Answer: Create an item as a “simple check”, e.g., with the name ping this host, and the key icmpping[{HOST.IP},3]. That can go into a template, by the way. If it succeeded it will return a 1.

I’ve made an error in my script for an external check. Why does Latest data show nothing at all? Answer: no idea. If the error is bad enough Zabbix will disable the item on you, so it’s not really running any longer. But even when it doesn’t do that, a lot of times I simply see no output whatsoever. Very frustrating.

Help! The Latest Data graph’s Y axis only shows 0’s and 5’s. Answer: Another wonderful Zabbix feature, this happens because your Units are too long. Even “per minute” as Units can get you into trouble if it is trying to draw a Y axis with values 22.0 22.5 23.0, etc: you’ll only see the .0’s and the .5’s. Change units to a maximum of seven characters such as “per min.”

Why is the output from an ssh check truncated, where does the rest go? Answer: no idea.

How do you increase the information contained in the zabbix server log? Let’s say your zabbix server is running normally. Then run this command: zabbix_server ‐R log_level_increase
You can run it multiple times to keep increasing the verbosity (log level), I think.

Attempting to use ssh items with key authentication fails with :”Public key authentication failed: Callback returned error” Initially I thought Zabbix was broken with regards to ssh public key authentication. I can get it to work with password. I can use my public/private key to authenticate by hand from command-line as root. Turns out running command such as sudo -u zabbix ssh … showed that my zabbix account did not have permissions to write to its home directory (which did not even exist). I guess this is a case of RTFM, because they do go over all those steps in the manual. I fixed up permissions and now it works for me, yeah.

Where should the scripts for external checks go? In my install it is /usr/lib/zabbix/externalscripts.

Why is the behaviour of triggers inconsistent. sometimes the same trigger has expected behaviour, sometimes not. Answer: No idea. Very frustrating. See more on that topic below.

How do you force a web scenario check when you are using templates? Answer: No idea.

Why do (resolved) Problems disappear no matter how you search for them if they are older than, say, 30 minutes? Answer: No idea. Just another stupid feature I guess.

Why does it say No media defined for user even though user has been set up with email as his media? Answer: no idea.

Why do too many errors disable an ssh check so that you get Status Disabled and have no graceful way to recover? Answer: no idea. It makes sense that Zabbix should not subject itself to too many consecutive errors. But once you’ve fixed the underlying problem the only recovery I can figure is to delete the item and recreate it. or delete the host and re-create it. Not cool.

I heard dependent items are the way to go to parse complex data coming out of a rich text item. How do you do that? Answer: Yes they are. I have gotten them to work and really give me the fine-grained control I’ve always wanted. I hope to show a real-life example soon. To get started creating a dependent item you can right-click on the dots of an item, or create a new item and choose type Dependent Item.

I am looking at Latest data and one item is grayed out and has no data. Why? Answer: almost no idea. This happens to me in a dependent item formed by a regular expression where the regular expression does not match the content. I am trying to make my RegEx more flexible to match both good and error conditions.

Why do my dependent items, when running a Check Now, say Cannot send request: wrong data type, yet they are producing data just fine when viewed through Latest data? Answer: this happens if you ran a Check Now on your template rather than when viewing an individual host. Make sure you select a host before you run Check Now. Actually, even still it does not work, so final answer: no idea.

Why do some regular expressions check out just fine on regex101.com yet produce a match for value of type “string”: pattern does not match error in Zabbix? Answer: Some idea. Fancy regular expressions do not seem to work for some reason.

Every time I add an item it takes the absolute maximum amount of time before I see data, whether or not I run check Now until I turn blue in the face. Why? Answer: no idea. Very frustrating.

If the Zabbix server is in one time zone and I am in another, can I have my view of timestamps customized to my time zone? Otherwise I see all times in the timezone of the Zabbix server. Answer: You are out of luck. The suggestion is to run two GUIs, one in your time zone. But there is a but. Support for this has been announced for v 5.20. Stay tuned…

My DNS queries using net.dns don’t do anything. Why? Answer: no idea. Maybe your host is not running an actual Zabbiox agent? That’ll do it. Forget that net.dns check if you can’t install an agent. Zabbix has no agentless DNS monitor for some strange reason.

A DNS query which returns many address records fails (such as querying an AD domain), though occasionally succeeds. Why? Answer: So your key looks something like this, right? net.dns.record[10.1.2.3,my-AD-domain.net,A,10,2,tcp]. And when you do the query through dig it works fine, right? E.g., dig +tcp my-AD-domain.net @10.1.2.3. And you’ve set the Zabbix response to type text? It seems to be just another Zabbix bug. You may have to use a script instead. Zabbix support has been able to reproduce this bug and they are working on it as we speak.

What does Check/Execute Now really do? Answer: essentially nothing as far as I tell. It certainly doesn’t “check now”, i.e., force the item to be run. However, if you have enough permissions, what you can do when you’re looking at an item for a specific Host is to run Test. Then Get Value. I usually get Permission Denied, however.

I want to show multiple things on a dashboard widget graph like an item plus its baseline (Ed: see references for calculating a baseline). What’s the best way? Answer: You can use the add new data set feature for instance to add your baseline. In your additional data set you put your baselines. Then I like to make the width 2, transparency 0 and fill 0. This will turn it into a thin bold line with a complementary color while not messing too much with the original colors of your items. The interface is squirrely, but, hey, it’s Zabbix, what did you expect?

I have a lot of hosts I want to add to a template. Does that Mass Update feature actually work? Answer: yes. Use it. It will save you time.

Help! I accidentally deleted an entire template. I meant to just delete one of its macros. Is there a revert? Answer: it doesn’t look like it. Hope you remember what you did…

It seems if I choose units in an item which have too many characters, e.g., client connections, the graph (in Latest Data) cuts it off and doesn’t even display the scale? Answer: seems so. It’s a bug. This won’t happen when using vector graphs in Dashboard. The graphs in Latest Data are PNG and limited to short Units, e.g., mbps. Changing to vector graphs has been in the roadmap but then disappeared.

Can I create a baseline? Nope. It’s on the roadmap. However, see this clever idea for building one on your own without too much effort.

I’ve put a few things on the same Dashboard graph. Why don’t they align? There are these big gaps. Zabbix runs the items when it feels like, and the result is gaps in data which Zabbix makes no attempt to conceal at the beginning and end of a graph. You can use Scheduling Intervals on your items to gain some control over this. See this article for details.

Besides cloning the whole thing, how can I change the name of a Dashboard? Answer: If you just click to edit a Dashboard the name appears fixed. However, click on the gear icon and that gives you the option to edit the dashboard name. It’s kind of an undocumented feature.

My SNMP MIB has bytes in/out for an interface when what I really want is bandwidth, i.e., Megabits per second. A little preprocessing on a 64-bit bytes value and you are there (32 bit values may roll over too frequently). See this article for details.

In functions like avg (sec|#num,<time_shift>), why is the time_shift argument so restricted? It can’t be a macro, contain a formula like 1w-30m, or anything semi-sophisticated. It just accepts a dumb literal like 5h? Answer: It’s just another shortcoming in Zabbix. How much did you pay for it? 🙂

I have an SNMP template with items for a hostgroup of dispersed servers. Some work fine. The one in Asia returns a few values, but not all. I am using Bulk Request. Answer (to your implied question!) You must have bad performance to that one. Use a Zabbix proxy with a longer timeout for SNMP requests. I was in that situation and that worked for me.

SNMPv3 situation. I have two identical virtual servers monitored by the same Zabbix proxy. Only one works. Command-line testing of snmpwalk looks fine. What could it be?’Answer: We are fighting this now. In our case the SNMP v3 engineIDs are identical on the two virtual servers because they were from the same image, whereas, if you read the specs, they are supposed to be unique, like a MAC address. Who knew? And, yes, once we made the engineIDs unique, they were fine in Zabbix.

Riddle: when is 80% not 80%? Answer: when pulling in used storage on a filesystem via SNMP and comparing it to storage size! I had carefully gotten a filesystem 83% full based on the output of df -m. But my trigger, set to go off at 80%, never went off. How could it be? The 83% includes some kind of reserved user space on the filesystem which is not included when you do the calculation directly. So I was at 78% or so in actuality. I changed the trigger to 75%.

My trigger for a DNS item, which relies on a simple diff(), goes off from time-to-time yet the response is the same. Why? Answer: We have seen this behavior for a CNAME DNS item. The response changed the case of the returned FQDN from time-to-time, and that is enough to set off the Zabbix diff()-based trigger! We pre-processed the output with a RegEx to just get the bits we wanted to examine to fix this.

Related question. My diff() trigger for a DNS item does NOT go off when the server actually goes down. What’s up with that? Answer: Although you might expect a suddenly unavailable server constitutes a “difference,” in Zabbix’s contorted view of reality it does not. I recommend an additional trigger using the function nodata().

Does the new feature of login using SAML actually work? Answer: Yes, we are using it in Zabbix v 5.0.

My OIDs for my filesystems keeps shifting around. What to do? Answer: Use low-level discovery. It’s yet another layer of abstraction and confusion, but it’s probably worth it. I intend to write up my approach in my practical Zabbix examples blog post.

After an Zabbix agent item goes bad (no data), Zabbix refuses to test it for a full 30 minutes after it went bad, despite an update interval of 5 minutes. Why? Answer: In one of the worst architectural decisions of all time, Zabbix created the concept of unsupported items. It works something like this: the very moment when you need to be told Hey there’s something wrong here is when Zabbix goes quiet. Your item became unsupported, which is like being in the penalty box for 30 minutes, during which time nothing works like you naively expected it to. Even the fact that your item became unsupported is almost impossible to find out from a trigger. An example of software which treats this situation correctly is Microfocus SiteScope. In Zabbix in version 5.0 there’s a global timeout for all unsupported items. Ours is set to 30 minutes, you see. In some cases that may make sense and prevent Zabbix from consuming too many resources trying to measure things which don’t work. I find it annoying. For DNS, specifically, best to use a key of type net.dns and not net.dns.record. That returns a simple 0 or 1 and does not become unsupported if the dns server can’t be reached. V 5.2 will provide some more options around this issue. For a HTTP agent and I suppose many other items, it’s best to create triggers which use the nodata() function,  which can somewhat compensate for this glaring weakness in Zabbix. If you run Zabbix v 5.2, you should use the new preprocessing rule “Check for not supported value” and then set new value e.g. “Error”. Then the Item won’t become unsupported and can also be used for triggers.

We’ve got SNMP items set up for a host. What’s the best way to alert for a total outage? Answer: I just learned this. This is closely related to the previous question. To avoid that whole unsupported item thing, you make a Zabbix internal item. the key is literally this: zabbix[host,snmp,available] and type is numeric unsigned. This wil continue to poll even if the other host items became unsupported. This is another poorly documented Zabbix feature.

While trying to set up a host for SNMP monitoring I get the error Cannot update Host. Cannot find host interface on host_name for item key item_name. Answer: You probably used an interface type of Agent instead of SNMP. Under Interfaces for the host, add one for type SNMP and remove the Agent one. Or, maybe the reverse: your item type is of type Zabbix agent but your host’s interface is of type SNMP – that combo also produces this error.

In Zabbix my SNMP item shows error No such instance currently exists at this OID, yet my snmpwalk for same shows it works. Why? Answer: In my case I switched to snmpget for my independent testing and reproduced that error, and found that I needed a literal .”0″ at the end of the OID (specifically for swap used on an F5 device). Once I included the .”0″ (with the double-quotes) in the OID in Zabbix it began to work. In another case I could do the snmpget from the same zabbix proxy where I was getting this error message. The custom MIB was right there in /usr/share/snmp/mibs on the Zabbix proxy. Zabbix hadn’t been started in awhile. I restarted it and the problem went away.

I wish to use a DNS value instead of an IP in net.tcp.service[service,IP,port] because I use geoDNS or round-robin DNS. Can I? Answer: It seems to work, yes.

Can I send alerts to MS teams? Answer: This is obviously a fake question. But the answer is Yes. You set up a Connector in a MS Teams channel. It’s pretty staight forward and it’s pretty cool. I’ll try to publish more in my Zabbix tips post if I have time.

Get a lot of false positives? Answer: Yes! On F5 equipment this one is vexing me:

Resolved: BIG-IP is unreachable via SNMP for 15 minutes

And for others (pool member unavailable for a few minutes) I tried to require two consecutive failures before sending an alert. Basically still working on it.

I have a bunch of HTTP items on this one Zabbix proxy. They all sort of go bad at the same time (false positives) and Zabbix says this agent is unreachable for five minutes around the same time. Answer: Seen that. Short term it may be advisable to create a dependent trigger: https://www.zabbix.com/documentation/5.0/manual/config/triggers/dependencies  Mid-term I am going to ask support about this problem.

Why is the name field truncated in Monitoring | Latest Data, with no possibility to increase it? Answer: If you have Show Details selected you see very few characters. Deselect that.

What, Zabbix version 5.2 RPMs are not available for RHEL 7? Answer: that is correct, unfortunately, as of this writing. You can run as high as v 5.0.7. We are trying to pressure them to provide this compatibility. Lots of people still run Redhat v 7.

Can you send reminder alerts periodically for a problem which persists? Answer: Yes you can. For instance, every four hours. Read all about it in the manual, under Action | Escalations, and look at their examples. However, the documentation is at odds with the product’s behaviour if you have multiple alerts with different durations defined. I am studying it…

Is Zabbix affected by the same hack that infected SolarWinds? Answer: No idea. Let’s see. Developed in Eastern Europe. Basically, no one’s saying. Let’s hope not.

Is Zabbix stupid enough to send multiple alerts for the same problem? Answer: In a word, yes. If you are unlucky enough to have defined overlapping alert conditions in your various alerts, Zabbix will make no effort to consolidate them.

What does it mean when I look at a host and I see inaccessible template? Answer: Most likely explanation is that you don’t have permission to see that template.

Can the y-axis be drawn in a logarithmic scale in a dashboard graph? I have low values (time for a DNS query) which sometimes soar to high ones. Answer: No. This feature has been requested now for almost 10 years and still is lacking. I will try to make a feature request.

Why does our Zabbix agent time out so often? The message is Zabbix agent on hostname is unreachable for five minutes. The problem is sporadic but it really interferes with the items like our simple net.dns checks. Answer: If you use a lot of net.dns agent items you can actually cause this behavior if you are running agent2. The default agent item is passive. We had better luck using an Active Agent item. We had severe but random timeouts and they all went away.

Our Webhook to MS Teams was working fine. Then we set up a new one to a new channel which wouldn’t work at all. A brief error message says invalid Webhook or something. What’s the fix? Answer: It is a known bug which is fixed in v 5.0.8. Of course a lot else could be wrong. In fairness Microsoft changes the format for webhooks from time-to-time so that could be the problem. This Microsoft page is a great resource to do your own testing of the Webhook: Sending messages to Connectors and Webhooks – Teams | Microsoft Docs

The formatting of alert emails is screwy, especially with line breaks in the wrong places. Can I force it to send HTML email to gain more control? Answer: Sort of. You can define a media type where you use HTML email instead of plain text email. I personally don’t have access to do that. But it is not possible to selectively use HTML email within the Custom email form of the alert setup screen. With the more straightforward custom emails, the trick is to put in extra line breaks. A single solitary linebreak is sometimes ignored, especially if the sequence is MACRO-FUNCTION linebreak more text. But if you use two consecutive linebreaks it will inject two linebreaks.

I swear Zabbix is ignoring my macros in trigger functions used in templates which refer to time values in minutes, and just filling in 0 instead. Is that even possible? Answer: I’m still investigating this one. I will withold my customary sardonic comments about Zabbix until I know who or what is to blame. [Later] I’m thinking this one is on me, not Zabbix.

Do Zabbix items, particularly HTTP items, have the concept of a hidden field to hide confidential data such as passwords from others with the same level of access? Answer: Apparently not. But if you believe in the terrible idea of security by obscurity, you can obscure values by stuffing them into a macro.

My Zabbix admin won’t let me get creative. No external items, no ssh items, etc. I can run some interesting scripts on my linux server. How to stuff the results into Zabbix? Answer: Install zabbix_sender utility on your linux. Then set up an item of type Zabbix trapper. The link to the RPM for zabbix_sender is in the references.

These days nothing is either black or white. So when a trigger fires, it’s likely it will return to good status, and then bad, and then good, etc. The alerts are killing us and casual users tend to discount all of them. What to do? Answer: This is common-sense, but, a very good strategy in these cases is to define a recovery expression for that trigger that looks at the average value for the last 3600 seconds and requires it to be in the good range before the trigger that all is good gets sent out as an alert.

I’m using the dynamic host feature in a dashboard. Unfortunately, one of my hosts has a really short name that matches so many other hosts that it never appears in the drop-down list. What to do? Answer: Click the “select” button to the right of the search field. Then you can choose the host group and from there the host. Or rename the host to somethng more unique.

I wish to add some explanatory text in the dashboard I’ve created. Is it possible? Answer: This is laughably kludgy, but you can do this with a map widget. What you can do is to create a map, add a text box to it, and put your desired text into the text box. But it is hard to get the sizing correct as things shrink when putting the widget on the dashboard.

My top hosts widget is now displaying 0’s. Answer: This happened after we upgraded from v 6.0 to 6.0.8. In characteristically Zabbix illogical fashion, if you now sort by BottomN instead of TopN you should see the expected results (highest on top). Not all our widgets displayed this bug!

I have an item which only runs once a week. Monitoring > Latest Data doesn’t show any values. Is that a bug or feature? Answer: There is a setting somewhere where you can change this behavior. Set it to last two weeks and all will be well.

While using the pyzabbix Zabbix api I had trouble switching from username/password to use an authentication token. Answer: Perhaps yuo installed both py-zabbix as well as pyzabbix? I’m confused by this. as there is some overlap. To use the token auth method – preferred  by experts – uninstall both these packages and re-install only pyzabbix. I will give an example in my other Zabbix blog post, Practical Zabbix examples.

The trigger.create api call says a dependent triggerid must be passed? Is that really mandatory? It makes no sense. Answer: No. I experimented with it and found you can just leave the dependencies out altogether. The documentation is wrong.

I need to create about 100 custom alerts. Is there seriously no way to do this via the api? Answer: apparently not.

What’s the correct way to send a compound filter expression via the api? Answer: Watch out! If you are trying to filter on suppressed problems, do not put a reference to suppressed in your filter. Instead it goes outside the filter like so: zapi.problem.get(…,suppressed=False,filter={‘name’:…})

Monitoring > Problems > History view is slow. Then it grays out periodically. Answer: Zabbix is spending all its time figuring out which host groups you have access to. To speed things up, explicitly enter only your accessible host groups in the filter.

geoMAP in Zabbix 6.0 is cool until you blow up a continent and see all the local geographical names written in their native language. So Asian placenames are inscrutable to Enlgih speakers. Is there any fix? Answer: You are probably using the provider, OpenStreetMap in this case, which is using localized names. You can switch providers (global setting).

I’m using a RegEx in the regsub function on an LLD macro. What flavor of RegEx are supported and what characters need to be escaped? Answer: Supposedly Perl-compatible (PCRE) RegExes are supported. For anything remotely complex, enclose your RegEx in double-quotes. Then, for good measure put a backslash (\) in front of any double-quote (“) you require as a match character, and a backslash in front of any slash (/) match character, plus the usual rules.

Why am I seeing the same host graph twice? Answer: This is a bug I have personally discovered in Zabbix 6.0. It occurs when you have a template with just a single item and a single graph. They will be working on it as of August 2022.

In latest data I see: Value of type “string” is not suitable for value type “numeric unsigned.” Why? Answer: I got this in Zabbix 6.4 when I used zabbix_sender with argument -o 36 which I thought would feed in the integer 36. But no, it got interpreted as a string. I tried to introduce a preprcoessing step but I could not get it to work. In the end I created a dependent item with a RegEx to convert it. I made the original item type character. I could not beat this in a simple way.

I can’t get my new agent to be seen by its Zabbix proxy. Error is failed to accept an incoming connection: from [agent]: reading first byte from connection failed: [104] Connection reset. Answer: You may be running a Palo Alto firewall perhaps? They will permit the tcp handshake and then drop the connection with a “reset both sides.” which produces this error. Thus super simplified connection tests you run by hand with nc/nmp may appear to work.

Does changing the name of a host change its hostid? Answer: No. We have a multi-stage discovery process which relies on this fact.

Does the hosts IP filter accept a subnet mask? Answer: No, it is very primitive. It does accept a partial IP, strangely enough, so 10.9.9 matches 10.9.9.0/24.

A word about SSH checks and triggers
Through the school of hard knocks I have learned that my ssh check is clipping the output from the executed command. So you know that partial data you see when you look at latest data, and thought it was truncating it for display purposes? Nuh, ah. That’s all you’re getting to go up against in your trigger, which sucks. It’s something like 260 characters. I got lucky in a sense to discover this early by running an ssh check against dns resolution of amazon.com. The response I got varied almost every 60 seconds depending on whether or not the response came out of the dns cache. So this was an excellent testbed to learn about the flakiness of triggers as well as waste an entire day.

Another thing about triggers with a regex. As far as I can tell the logic is reversed. So you think you’re defining the OK condition when you seek to match the output and have it given the value of 1. But instead try to match the desired output for the OK condition, but assign it a value of 0. I guess. Only that approach seems to work for me. And getting the regex to treat multiple lines as a unit was also a little tricky. I think by default it favored testing only against the last line.

So let’s say my output as scraped from Monitoring|Latest Data alternated between either

proxy1&gt;test dns amazon.com
Performing DNS lookup for: amazon.com
 
DNS Response data:
Official Host Name: amazon.com
Resolved Addresses:
  205.251.242.103
  176.32.98.166
  176.32.103.205
Cache TTL: 1, cache HIT
DNS Resolver Response: Success

or

proxy1&gt;test dns amazon.com
Performing DNS lookup for: amazon.com
 
Sending A query for amazon.com to 192.168.135.145.
 
Sending A query for amazon.com to 8.8.8.8.
 
DNS Response data:
Official Host Name: amazon.com
Resolved Addresses:
  20

, then here is my iregexp expression which seems to do the correct thing (treat both of these outcomes as successes):

{proxy1:ssh.run[resolve DNS,1.2.3.4,22,utf-8].iregexp("(?s)((205\.251\.|176\.32\.)|Sending A query.+\s20)")}=0

Note that the (?s) at the beginning helps, I think, to treat the newline character as just another character which matches “.”. I may have an extra set of parentheses around the outermost alternating expression, but I can only experiment so much…

I ran various tests such as to change just one of the numbers to make sure it triggered.

I now think I will get better, i.e., more complete, results if I make the item of type text rather than character, at least that switch definitely helped with another truncated output I was getting from another ssh check. So, yes, now I am capturing all the output. So, note to self, use type text unless you have really brief output from your ssh check.

So with all that gained knowledge, my simplified expression now reads like this:

{proxy:ssh.run[resolve a dns name,1.2.3.4,22,utf-8].iregexp("(205\.251\.|176\.32\.)")}=0

Here’s a CPU trigger. From a show status it focuses on the line:

CPU utilization: 29%

and so if I want to trigger a problem for 95% or higher CPU, this expression works for me:

CPU utilization:\s+([ 1-8]\d|9[0-4])\%

A nice online regular expression checker is https://regexr.com/

And a very simple PING test ssh check item, where the expected resulting line will be:

5 packets transmitted, 5 packets received, 0% packet loss

– for that I used the item wizard, altered what it came up with, and arrived at this:

(({proxy:ssh.run[ping 8.8.8.8,1.2.3.4,22,utf-8].iregexp("[45] packets received")})=0)

So I will accept the results as OK as long as at most one of five packets was dropped.

A lesson learned from SNMP monitoring of F5 devices
My F5 BigIP devices began producing problems as soon as we set up the SNMP monitoring. Something like this:

Node /Common/drj-10_1_2_3 is not available in some capacity: blue (4)

It never seemed to matter until now that my nodes appear blue. But perhaps SNMP is enforcing a best practice and expecting nodes to not be blue, meaning to be monitored. And it turns out you can set up a default monitor for your nodes (I use gateway_icmp). It’s found in Nodes | Default Monitor. I’m not sure why this is not better documented by F5. After this, many legacy nodes turn red so I am cleaning them up… But my conclusion is that I have learned something about my own systems from the act of implementing this monitoring, and that’s a good thing.

To be continued…

References and related
A good commercial solution for infrastructure monitoring: Microfocus SiteScope.

DIY monitoring

The Zabbix manual

Direct link to Zabbix Repos (RPMs), including standalone RPMs for zabbix_sender, zabbix_get and zabbix_js: https://repo.zabbix.com/zabbix/5.0/rhel/8/x86_64/

A nice online regular expression (RegEx) checker is: https://RegEx101.com/.

Another online regular expression checker is: https://RegExr.com/.

Just to put it out there: If you like Zabbix you may also like Specto. Specto is an open-source tool for monitoring web sites (“synthetic” monitoring). I know one major organization which uses it so it can’t be too bad. https://specto.sourceforge.net/

Since this document is such a mess I’m starting to document some of my interesting items and Practical Zabbix examples in this newer and cleaner post. It includes the baseline calculation formula.

Categories
Admin Perl Web Site Technologies

Turning HP SiteScope into SiteScope Classic with Perl

Intro
HP siteScope is a terrific web application tool and not too expensive for those who have any kind of a budget. The built-in monitor types are a bit limited, but since it allows calls to user-provided scripts your imagination is the only real limitation. For those with too many responsibilities and too little time on their hands it is a real productivity enhancer.

I’ve been using the product for 12 years now – since it was Freshwater SiteScope. I still have misgivings about the interface change introduced some years ago when it was part of Mercury. It went from simple and reliable to Java, complicated and flaky. To this day I have to re-start a SiteScope screen in my browser on a daily basis as the browser cannot recover from a server restart or who knows what other failures.

So I longed for the days of SiteScope Classic. We kept it running for as long as possible, years in fact. But at some point there were no more releases created for the classic view. So I investigated the feasibility of creating my own conversion tool. And…partially succeeded. Succeeded to the point where I can pull up the web page on my Blackberry and get the statuses and history. Think you can do that with regular HP SiteScope? I can’t. Maybe there’s an upgrade for it, but still. It’s nice to have the classic interface when you want to pull up the statuses as quickly as possible, regardless of the Blackberry display issue.

Looking back at my code, I obviously decided to try my hand at OO (object oriented) programming in Perl, with mixed results. Perl’s OO syntax isn’t the best, which addles comprehension. Without further ado, let’s jump into it.

The Details
It relies on something I noticed, that this URL on your HP SiteScope server, http://localhost:8080/SiteScope/services/APIConfigurationImpl?method=getConfigurationSnapshot, contains a tree of relationships of all the monitors. Cool, right? But it’s not a tree like you or I would design. Between parent and child is an intermediate layer. I suppose you need that because a group can contain monitors (my only focus in this exercise), but it can also contain alerts and maybe some other properties as well. So I guess the intermediate layer gives them the flexibility to represent all that, though it certainly added to my complication in parsing it. That’s why you’ll see the concern over “grandkids.” I developed a recursive, web-enabled Perl program to parse through this xml. That gives me the tools to build the nice hierarchical groupings. But it does not give me the statuses.

For the status of each monitor I wrote a separate scraper script that simply reads the entire daily SiteScope log every minute! Crude, but it works. I use it for an installation with hundreds of monitors and a log file that grows to 9 MB by the end of the day so I know it scales to that size. Beyond that it’s untested.

In addition to giving only the relationships, the xml also changes with every invocation. It attaches ID numbers to the monitors which initially you think is a nice unique identifier, but they change from invocation to invocation! So an additional challenge was to match up the names of the monitors in the xml output to the names as recorded in the SiteScope log. Also a bit tricky, but in general doable.

So without further ado, here’s the source code for the xml parser and main program which gets called from the web:

#!/usr/bin/perl
# Copyright work under the Artistic License, http://www.opensource.org/licenses/Artistic-2.0
# build v.simple SiteScope web GUI appropriate for smartphones
# 7/2010
#
# Id is our package which defines th Id class
use Id;
use CGI::Pretty;
my $cgi=new CGI;
$DEBUG = 0;
# GIF location on SiteScope classic
$ssgifs = "/artwork/";
$health{good} = qq(<img src="${ssgifs}okay.gif">);
$health{error} = qq(<img src="${ssgifs}error.gif">);
$health{warning} = qq(<img src="${ssgifs}warning.gif">);
# report CGI
$rprt = "/SS/rprt";
# the frustrating thing is that this xml output changes almost every time you call it
$url = 'http://localhost:8080/SiteScope/services/APIConfigurationImpl?method=getConfigurationSnapshot';
# get current health of all monitors - which is scraped from the log every minute by a hilgarj cron job
$monitorstats = "/tmp/monitorstats.txt";
print "Content-type: text/plain\n\n" if $DEBUG;
open(MONITORSTATS,"$monitorstats") || die "Cannot open monitor stats file $monitorstats!!";
while(<MONITORSTATS>) {
  chomp;
  ($monitor,$status,$value) = /([^\t]+)\t([^\t]+)\t([^\t]+)/;
  $monitors{"$monitor"} = $status;
  $monitorv{"$monitor"} = $value;
}
open(CURL,"curl $url 2>/dev/null|") || die "cannot open $url for reading!!\n";
my %myobjs = ();
# the xml is one long line!
@lines = <CURL>;
#print "xml line: $lines[0]\n" if $DEBUG;
@multiRefs = split "<multiRef",$lines[0];
#parse multiRefs
# create top-level object
my $id = Id->new (
      id => "id0");
# hash of this object with id as key
$myobjs{"id0"} = $id;
 
# first build our objects...
foreach $mref (@multiRefs) {
  next unless $mref =~ /\sid=/;
#  id="id0" ...
  ($parentid) =  $mref =~ /id=\"(id\d+)/;
  print "parentid: $parentid\n" if $DEBUG;
# watch out for <item><key xsi:type="soapenc:string">groupSnapshotChildren</key><value href="#id3 ...
# vs <item><key xsi:type="soapenc:string">Network</key><value href="#id40"/>
  print "mref: $mref\n" if $DEBUG;
  @ids = split /<item><key/, $mref;
# then loop over ids mentioned in this mref
  foreach $myid (@ids) {
    next unless $myid =~ /href="#(id\d+)/;
    next unless $myobjs{"$parentid"};
# types include group, monitor, alert
    ($typebyregex) = $myid =~ />snapshot_(\w+)SnapshotChildren</;
    $parenttype = $myobjs{"$parentid"}->type();
    $type = $typebyregex ? $typebyregex : $parenttype;
    print "type: $type\n" if $DEBUG;
# skip alert definitions
    next if $type eq "alert";
    print "myid: $myid\n" if $DEBUG;
    ($actualid) = $myid =~ /href="#(id\d+)/;
    print "actualid: $actualid\n" if $DEBUG;
# construct object
    my $id = Id->new (
      id => $actualid,
      type => $type,
      parentid => $parentid );
# build hash of these objects with actualid as key
    $myobjs{$actualid} = $id;
# addchild to parent. note that parent should already have been encountered
    $myobjs{"$parentid"}->addchild($actualid);
    if ($myid !~ /groupSnapshotChildren/) {
# interesting child - has name (every other generation has no name!)
      ($name) = $myid =~ /string\">(.+?)<\/key/;  # use non-greedy operator
      print "name: $name\n" if $DEBUG;
# some names are not of interest to us: alerts, which end in "error" or "good"
      if ($name !~ /(error|good)$/) {
# name may not be unique - get extended name which include all parents
        if (defined $myobjs{"$parentid"}->parentid()) {
          $gdparid = $myobjs{"$parentid"}->parentid();
          $gdparname = $myobjs{"$gdparid"}->extname();
# extname -> extended, or distinguished name.  Should be unique
          $extname = $gdparname. '/' . $name;
        } else {
# 1st generation
          print "1st generation\n" if $DEBUG;
          $extname = $name;
        }
        print "extname: $extname\n" if $DEBUG;
        $id->name($name);
        $id->extname($extname);
        $id->isanamedid(1);
        $myobjs{"$parentid"}->hasnamedkids(1); # want to mark its parent as "special"
# we also need our hash to reference objects by extended name since id changes with each extract and name
may not be unique
        $myobjs{"$extname"} = $id;
      } # end conditional over desirable name check
    } else {
      $id->isanamedid(0);
    }
  }
}
#
# now it's all parsed and our objects are alive. Let's build a web site!
#
# build a cookie containing path
my $pi = $ENV{PATH_INFO};
$script = $ENV{SCRIPT_NAME};
$ua = $ENV{HTTP_USER_AGENT};
# Blackberry browser test
$BB = $ua =~ /^BlackBerry/i ? 1 : 0;
$MSIE = $ua =~ /MSIE /;
# font-size depends on browser
$FS = "font-size: x-small;" if $MSIE;
$cookie = $cgi->cookie("pathinfo");
$uri = $script . $pi;
$cookie=$cgi->cookie(-name=>"pathinfo", -value=>"$uri");
print $cgi->header(-type=>"text/html",-cookie=>$cookie);
($url) = $pi =~ m#([^/]+)$#;
#  -title=>'SmartPhone View',
# this doesn't work, sigh...
#print $cgi->start_html(-head=>meta({-http_equiv=>'Refresh'}));
print qq( <HEAD>
<meta http-equiv="Expires" content="0">
<meta http-equiv="Pragma" content="no-cache">
<meta HTTP-EQUIV="Refresh" CONTENT="60; URL=$url">
<TITLE>SiteScope Classic $url Detail</TITLE>
<style type="text/css">
a.good {color: green; }
a.warning {color: green; }
a.error {color: red; }
td {font-family: Arial, Helvetica, sans-serif; $FS}
p.ss {font-family: Arial, Helvetica, sans-serif;}
</style>
<link rel="shortcut icon" href="/favicon.ico" type="image/x-icon" />
<script type=text/javascript>
function changeme(elemid,longvalue)
{
document.getElementById(elemid).innerText=longvalue;
}
function restoreme(elemid,truncvalue)
{
document.getElementById(elemid).innerText=truncvalue;
}
</script>
</HEAD><body>
);
 
#print $cgi->h1("This is the heading");
# parse path
# top lvl name:2nd lvl name:3rd lvl name
$altpi = $cgi->path_info();
print $cgi->p("pi is $pi") if $DEBUG;
#print $cgi->p("altpi is $altpi");
# relative url
$rurl = $cgi->url(-relative=>1);
if ($pi eq "") {
# the top
# top id is id3
  print qq(<p class="ss">);
  $myid = "id3";
  foreach $kid ($myobjs{"$myid"}->get_children()) {
    my $kidname = $myobjs{"$kid"}->name();
# kids can be subgroups or standalone monitors
    my $health = recurse("/$kidname");
    print "$health{$health} <a href=\"$rurl/$kidname\">$kidname</a><br>\n";
    $prodtest = $kid if $kidname eq "Production";
  }
  print "</p>\n";
} else {
  $extname = $pi;
  print "pi,name,extname,script: $pi,$name,$extname,$script\n" if $DEBUG;
# print where we are
  $uriname = $pi;
  $uriname =~ s#^/##;
  #print $cgi->p("name is $name");
  #print $cgi->p("uriname is $uriname");
  $uricompositepart = "/";
  @uriparts = split('/',$uriname);
  $lastpart = pop @uriparts;
  print qq(<p class="ss"><a href="$script"><b>Sitescope</b></a><br>);
  print qq(<b>Monitors in: );
  foreach $uripart (@uriparts) {
    my $healthp = recurse("$uricompositepart$uripart");
# build valid link
    ##$link = qq(<a class="good" href="$script$uricompositepart$uripart">$uripart</a>: );
    $link = qq(<a class="$healthp" href="$script$uricompositepart$uripart">$uripart</a>: );
    $uricompositepart .= "$uripart/";
    print $link;
  }
  my $healthp = recurse("$uricompositepart$lastpart");
  $color = $healthp eq "error" ? "red" : "green";
  print qq(<font color="$color">$lastpart</font></b></p>\n);
  print qq(<table border="1" cellspacing="0">);
  #print qq(<table>);
  %hashtrs = ();
  foreach $kid ($myobjs{"$extname"}->get_children()) {
    print "kid id: " . $myobjs{"$kid"}->id() . "\n" if $DEBUG;
    next unless $myobjs{"$kid"}->hasnamedkids();
    foreach $gdkid ($myobjs{"$kid"}->get_children()) {
      print "gdkid id: " . $myobjs{"$gdkid"}->id() . "\n" if $DEBUG;
      $gdkidname = $myobjs{"$gdkid"}->name();
      $gdkidextname = $myobjs{"$gdkid"}->extname();
      my $health = recurse("$gdkidextname");
      my $type = $myobjs{"$gdkid"}->type();
# dig deeper to learn health of the grankid's grandkids
      $objct = $healthct{good} = $healthct{error} = $healthct{warning} = 0;
      foreach $ggkid ($myobjs{"$gdkidextname"}->get_children()) {
        print "ggkid id: " . $myobjs{"$ggkid"}->id() . "\n" if $DEBUG;
        next unless $myobjs{"$ggkid"}->hasnamedkids();
        foreach $gggdkid ($myobjs{"$ggkid"}->get_children()) {
          print "gggdkid id: " . $myobjs{"$gggdkid"}->id() . "\n" if $DEBUG;
          $gggdkidname = $myobjs{"$gggdkid"}->name();
          $gggdkidextname = $myobjs{"$gggdkid"}->extname();
          my $health = recurse("$gggdkidextname");
          $objct++;
          $healthct{$health}++;
        }
      }
      $elemct++;
      $elemid = "elemid" . $elemct;
# groups should have distinctive cell background color to set them apart from monitors
      if ($type eq "group") {
        $bgcolor = "#F0F0F0";
        $celllink = "$lastpart/$gdkidname";
        $truncvalue = qq(<font color="red">$healthct{error}</font>/$objct);
        $tdval = $truncvalue;
      } else {
        $bgcolor = "#FFFFFF";
        $celllink = "$rprt?$gdkidname";
# truncate monitor value to save display space
        $longvalue = $monitorv{"$gdkidname"};
        (my $truncvalue) = $monitorv{"$gdkidname"} =~ /^(.{7,9})/;
        $truncvalue = $truncvalue? $truncvalue : "&nbsp;";
        $tdval = qq(<span id="$elemid" onmouseover="changeme('$elemid','$longvalue')" onmouseout="restorem
e('$elemid','$truncvalue')">$truncvalue</span>);
      }
      $hashtrs{"$gdkidname"} = qq(<tr><td bgcolor="#000000">$health{$health} </td><td>$tdval</td><td bgcol
or="$bgcolor"><a href="$celllink">$gdkidname</a></td></tr>\n);
# for health we're going to have to recurse
    }
  }
# print out in alphabetical order
  foreach $key (sort(keys %hashtrs)) {
    print $hashtrs{"$key"};
  }
  print "</table>";
}
print $cgi->end_html();
#######################################
sub recurse {
# to get the union of health of all ancestors
my $moniext = shift;
my ($moni) = $moniext =~ m#/([^/]+)$#;
# don't bother recursing and all that unless we have to...
return $myobjs{"$moniext"}->health() if defined $myobjs{"$moniext"}->health();
print "moni,moniext: $moni, $moniext\n" if $DEBUG;
my ($kid,$gdkidextname,$health,$cumhealth);
$cumhealth = $health = $monitors{"$moni"} ? $monitors{"$moni"} : "good";
foreach $kid ($myobjs{"$moniext"}->get_children()) {
    if ($myobjs{"$kid"}->hasnamedkids()) {
      foreach $gdkid ($myobjs{"$kid"}->get_children()) {
        $gdkidextname = $myobjs{"$gdkid"}->extname();
# for health we're going to have to recurse
        $health = recurse("$gdkidextname");
        if ($health eq "error" || $cumhealth eq "error") {
          $cumhealth = "error";
        } elsif ($health eq "warning" || $cumhealth eq "warning") {
          $cumhealth = "warning";
        }
      }
    } else {
# this kid is end of line
      $health = $monitors{"$kid"} ? $monitors{"$kid"} : "good";
        if ($health eq "error" || $cumhealth eq "error") {
          $cumhealth = "error";
        } elsif ($health eq "warning" || $cumhealth eq "warning") {
          $cumhealth = "warning";
        }
    }
}
$myobjs{"$moniext"}->health("$cumhealth");
return $cumhealth;
} # end sub recurse

I call it simply “ss” to minimize the typing required. You see it uses a package called Id.pm which I wrote to encapsulate the class and methods. Here is Id.pm:

package Id;
# Copyright work under the Artistic License, http://www.opensource.org/licenses/Artistic-2.0
# class for storing data about an id
# URL (not currently protected): http://localhost:8080/SiteScope/services/APIConfigurationImpl?method=getC
onfigurationSnapshot
# class for storing data about a group
use warnings;
use strict;
use Carp;
#group methods
# constructor
# get_members
# get_name
# get_id
# addmember
#
# member methods
# constructor
# get_id
# get_name
# get_type
# get_gp
# set_gp
 
sub new {
  my $class = shift;
  my $self = {@_};
  bless($self, "Id");
  return $self;
}
# get-set methods, p. 355
sub parentid { $_[0]->{parentid}=$_[1] if defined $_[1]; $_[0]->{parentid} }
sub isanamedid { $_[0]->{isanamedid}=$_[1] if defined $_[1]; $_[0]->{isanamedid} }
sub id { $_[0]->{id}=$_[1] if defined $_[1]; $_[0]->{id} }
sub name { $_[0]->{name}=$_[1] if defined $_[1]; $_[0]->{name} }
sub extname { $_[0]->{extname}=$_[1] if defined $_[1]; $_[0]->{extname} }
sub type { $_[0]->{type}=$_[1] if defined $_[1]; $_[0]->{type} }
sub health { $_[0]->{health}=$_[1] if defined $_[1]; $_[0]->{health} }
sub hasnamedkids { $_[0]->{hasnamedkids}=$_[1] if defined $_[1]; $_[0]->{hasnamedkids} }
 
# get children - use anonymous array, book p. 221-222
sub get_children {
# return empty array if arrary hasn't been defined...
  defined @{$_[0]->{children}} ? @{$_[0]->{children}} : ();
}
# adding children
sub addchild {
  $_[0]->{children} = [] unless defined  $_[0]->{children};
  push @{$_[0]->{children}},$_[1];
}
 
1;

ss also assumes the existence of just a few of the images from SiteScope classic – the green circle for good, red diamond for error and yellow warning, etc.. I borrowed them SiteScope classic.

Here is the code for the log scraper:

#!/usr/bin/perl
# analyze SiteScope log file
# Copyright work under the Artistic License, http://www.opensource.org/licenses/Artistic-2.0
# 8/2010
$DEBUG = 0;
$logdir = "/opt/SiteScope/logs";
$monitorstats = "/tmp/monitorstats.txt";
$monitorstatshis = "/tmp/monitorstats-his.txt";
$date = `date +%Y_%m_%d`;
chomp($date);
$file = "$logdir/SiteScope$date.log";
open(LOG,"$file") || die "Cannot open SiteScope log file: $file!!\n";
# example lines:
# 16:51:07 08/02/2010     good    LDAPServers     LDAP SSL test : ldapsrv.drj.com exit: 0, 0.502 sec    1:
3481  0       502
#16:51:22 08/02/2010     good    Network DNS: (AMEAST) ns2  0.033 sec   2:3459      200     33      ok
#16:51:49 08/02/2010     good    Proxy   proxy.pac script on iwww    0.055 sec   2:12467 200     55   ok
     4288    1280782309      0    0  55      0       0      200  0
#16:52:04 08/02/2010     good    Proxy   Disk Space: earth /logs   66% full, 13862MB free, 41921MB total
 3:3598      66      139862
#16:52:09 08/02/2010     good    DrjExtranet  URL: wwwsecure.drj.com     0.364 sec    1:3604      200
364  ok 26125   1280782328     0    0   358     4       2       200  0
while(<LOG>) {
  ($time,$date,$status,$group,$monitor,$value) = /(\S+)\s(\S+)\t(\S+)\t(\S+)\t([^\t]+)\t([^\t]+)/;
  print '$time,$date,$status,$group,$monitor,$value' . "$time,$date,$status,$group,$monitor,$value\n" if $DEBUG;
  next if $group =~ /__health__/; # don't care about these lines
  $mons{"$monitor"} = 1;
  push @{$mont{"$monitor"}} , $time;
  push @{$mond{"$monitor"}} , $date;
  push @{$monh{"$monitor"}} , $status;
  push @{$monv{"$monitor"}} , $value;
}
# open output at last moment to minimize chances of reading while locked for writing
open(MONITORSTATS,">$monitorstats") || die "Cannot open monitor stats file $monitorstats!!\n";
open(MONITORSTATSHIS,">$monitorstatshis") || die "Cannot open monitor stats file $monitorstatshis!!\n";
# write it all out - will always print the latest values
foreach $monitor (keys %mons) {
# dereference our anonymous arrays
  @times = @{$mont{"$monitor"}};
  @dates = @{$mond{"$monitor"}};
  @status = @{$monh{"$monitor"}};
  @value = @{$monv{"$monitor"}};
# last element is the latest measured status and value
  print MONITORSTATS "$monitor\t$status[-1]\t$value[-1]\n";
  print MONITORSTATSHIS "$monitor\n";
  #for ($i=-11;$i<0;$i++) {
# put latest measure on top
  for ($i=-1;$i>-13;$i--) {
    $time = defined $times[$i] ? $times[$i] : "NA";
    $date = defined $dates[$i] ? $dates[$i] : "NA";
    $stat = defined $status[$i] ? $status[$i] : "NA";
    $val = defined $value[$i] ? $value[$i] : "NA";
    print MONITORSTATSHIS "\t$time\t$date\t$stat\t$val\n";
  }
}

As I said it gets called every minute by cron.

That’s it! I enter the url sitescope.drj.com/SS/ss to access the main program which gets executed because I made /SS a CGI-BIN directory.

This gives you a read-only, Java-free view into your SiteScope status and hierarchy which beckons back to the good old days of Freshwater SiteScope.

Know your limits
What it does not do, unfortunately, is allow you to run a monitor – that seems like the next most simple thing which I should have been able to do but couldn’t figure out – much less define new monitors (never going to happen) or alerts.

I use this successfully against my HP SiteScope instance of roughly 400 monitors which itself is on a VM and there is no apparent strain. At some point this simple-minded script would no longer scale to suit the task at hand, but it might be good for up to a few thousand monitors.

And now a word about open source alternatives
Since I was so enamored with SiteScope Classic there seemed to be no compelling reason to shell out the dough for HP SiteScope with its unwanted interface, so I briefly looked around at free alternatives. Free sounds good, right? Not so much in practice. Out there in Cyberspace there is an enthusiast for a product called Zabbix. I just want to go on the record that Zabbix is the most confused piece of junk I have run across. You are getting less than what you paid for ($0) because you will be wasting a lot of time with it, and in the end it isn’t all that capable. Nagios also had its limits – I can’t remember the exact reason I didn’t go down that route, but there were definite reasons.

HP SiteScope is no panacea. “HP” and “stifling bureaucracy” need to be mentioned in the same sentence. Every time we renew support it is the most confusing mess of line items. Every time there’s a new cast of characters over at HP who nothing about the account’s history. You practically have to beg them to accept your money for a low-budget item like SiteScope because they really don’t pursue it in any way. Then their SAID and contract numbers stuff is confusing if you only see it once every few years.

Conclusion
A conversion program does exist for turning the finicky HP SiteScope Java-encumbered view into pure SiteScope Classic because I wrote it! But it’s a limited read-only view. Still, it’s helpful in a pinch and can even be viewed on the Blackberry’s browser.

Another problem is that HP has threatened to completely change the API so this tool, which is designed for HP SiteScope v 10.12, will probably completely break for newer versions. Oh, well.

References
This post shows some silly mistakes to avoid when doing a minor upgrade in version 11.