Categories
Visibility

Everything I need to know about Influxdb and Grafana

Intro

Some of my colleagues had used Influxdb and Grafana at their previous job so they thought it might fit for what we’re doing in the Visibility team. It sounded good in theory, anyway, so I had to agree. There were a lot of pitfalls. Eventually I got it to the point where I’m satisfied with my accomplishments and want to document the hurdles I’ve overcome.

So as time permits I will be fleshing this out.

Grafana

I’m going to lead with the picture and then the explanation makes a lot more sense.

Single vedge heat map

I’ve spent the bulk of my time wrestling with Grafana. Actually it looks like a pretty capable tool. It’s mostly just understanding how to make it do what you are dreaming about. Our installed version currently is 9.2.1.

My goal is to make a heatmap. But a special kind similar to what I saw the network provider has. That would namely entail one vedge per row, and one column per hour, hence, 24 columns in total. A vedge is a kind of SD-Wan router. I want to help the networking group look at hundreds of them at a time. So that’s on potential dashboard. It would give a view of a day. Another dashboard would show just one router with the each row representing a day, and the columns again showing an hour. Also a heatmap. The multi-vedge dashboard should link to the individual dashboard, ideally. In the end I pulled it off. I am also responsible for feeding the raw data into Influxdb and hence also for the table design.

Getting a workable table design was really imporant. I tried to design it in a vacuum, but that only partially worked. So I revised, adding tags and fields as I felt I needed to, while being mindful of not blowing up the cardinality. I am now using these two tables, sorry, measurements.

vedge measurement
vedge_stats measurement

Although there are hundreds of vedges, some of my tags are redundant, so don’t get overly worried about my high cardinality. UTChour is yes a total kludge – not the “right” way to do things. But I’m still learning and it was simpler in my mind. item in the first measurement is redundant with itemid. But it is more user-friendly: a human-readable name.

Influx Query Explorer

It’s very helpful to use the Explorer, but the synatx there is not exactly the same as it will be when you define template variables. Go figure.

Multiple vedges for the last day

So how did I do it in the end?

Mastering template variables is really key here. I have a drop-down selection for region. In Grafana-world it is a custom variable with potential values EU,NA,SA,AP. That’s pretty easy. I also have a threshold variable, with possible values: 0,20,40,60,80,90,95. And a math variable with values n95,avg,max. More recently I’ve added a threshold_max and a math_days variable.

It gets more interesting however, I promise. I have a category variable which is of type query:

from(bucket: "poc_bucket2")
|> range (start: -1d)
|> filter(fn:(r) => r._measurement == "vedge_stats")
|> group()
|> distinct(column: "category")

Multi-value and Include all options are checked. Just to make it meaningful, category is assigned by the WAN provider and has values such as Gold, Silver, Bronze.

And it gets still more interesting because the last variable depends on the earlier ones, hence we are using chained variables. The last variable, item, is defined thusly:

from(bucket: "poc_bucket2")
|> range (start: -${math_days}d)
|> filter(fn:(r) => r._measurement == "vedge_stats" and r.region == "${Region}")
|> filter(fn:(r) => contains(value: r.category, set: ${category:json}))
|> filter(fn:(r) => r._field == "${math}" and r._value >= ${threshold} and r._value <= ${threshold_max})
|> group()
|> distinct(column: "item")

So what it is designed to do is to generate a list of all the items, which in reality are particular interfaces of the vedges, into a drop-down list.

Note that I want the user to be able to select multiple categories. It’s not well-documented how to chain such a variable, so note the use of contains and set in that one filter function.

And note the double-quotes around ${Region}, another chained variable. You need those double-quotes! It kind of threw me because in Explorer I believe you may not need them.

And all that would be simply nice if we didn’t somehow incorporate these template variables into our panels. I use the Stat visualization. So you’ll get one stat per series. That’s why I artifically created a tag UTChour, so I could easily get a unique stat box for each hour.

The stat visualization flux Query

Here it is…

data = from(bucket: "poc_bucket2")
  |> range(start: -24h, stop: now())
  |> filter(fn: (r) =>
    r._measurement == "vedge" and
    r._field == "percent" and r.hostname =~ /^${Region}/ and r.item == "${item}"
  )
  |> drop(columns: ["itemid","ltype","hostname"])
data

Note I hae dropped my extra tags and such which I do not wish to appear during a mouseover.

Remember our regions can be one of AP,EU,NA or SA? Well the hostnames assigned to each vedge start with the two letters of its region of location. Hence the regular explression matching works there to restrict consideration to just the vedges in the selected region.

We are almost done.

Making it a heat map

So my measurement has a tag called percent, which is the percent of available bandwidth that is being used. So I created color-based thresholds:

Colorful percent-based thresholds

You can imagine how colorful the dashboard gets as you ratchet up the threshold template variable. So the use of these thresholds is what turns our stat squares into a true heatmap.

Heatmap visualization

I found the actual heatmap visualization useless for my purposes, by the way!

There is also an unsupported heatmap plugin for Grafana which simply doesn’t work. Hence my roll-your-own approach.

Repitition

How do we get a panel row per vedge? The stat visualization has a feature called Repeat Options. So you repeat by variable. The variable selected is item. Remember that item came from our very last template variable. Repeat direction is Vertical.

For calculation I choose mean. Layout orienttion is Vertical.

The visualization title is also variable-driven. It is ${item} .

The panels are long and thin. Like maybe two units high? – one unit for the label (the item) and the one below it for the 24 horizontal stat boxes.

Put it all together and voila, it works and it’s cool and interactive and fast!

Single vedge heatmap data over multiple days

Of course this is very similar to the multiple vedge dashboard. But now we’re drilling down into a single vedge to look at its usage over a period of time, such as the last two weeks.

Flux query
import "date"
b = date.add(d: 1d, to: -${day}d)
data = from(bucket: "poc_bucket2")
  |> range(start: -${day}d, stop: b)
  |> filter(fn: (r) =>
    r._measurement == "vedge" and
    r._field == "percent" and
    r.item == "$item"
  )
  |> drop(columns:["itemid","ltype","hostname"])
data
Variables

As before we have a threshold, Region and category variable with category derived from the same flux query shown above. A new variable is day, which is custom and hidden, It has values 1,2,3,4,…,14. I don’t know how to do a loop in flux or I might have opted a more elegant method to specify the last 14 days.

I did the item variable query a little different, but I think it’s mostly an alternate and could have been the same:

from(bucket: "poc_bucket2")
|> range (start: -${math_days}d)
|> filter(fn:(r) => r._measurement == "vedge_stats" and r.region == "${Region}")
|> filter(fn:(r) => contains(value: r.category, set: ${category:json}))
|> filter(fn:(r) => r._field == "${math}" and r._value >= ${threshold} and r._value <= ${threshold_max})
|> group()
|> distinct(column: "item")

Notice the slightly different handling of Region. And those double-quotes are important, as I learned from the school of hard knocks!

The flux query in the panel is of course different. It looks like this:

import "date"
b = date.add(d: 1d, to: -${day}d)
data = from(bucket: "poc_bucket2")
  |> range(start: -${day}d, stop: b)
  |> filter(fn: (r) =>
    r._measurement == "vedge" and
    r._field == "percent" and
    r.item == "$item"
  )
  |> drop(columns:["itemid","ltype","hostname"])
data

So we’re doing some date arithmetic so we can get panel strips, one per day. These panels are long and thin, same as before, but I omitted the title since it’s all the same vedge.

The repeat options are repeat by variable day, repeat direction Vertical as in the other dashboard. The visualization is Stat, as in the other dashboard.

And that’s about it! Here the idea is that you play with the independent variables such as Region and threshold, it generates a list of matching vedge interfaces and you pick one from the drop-down list.

Linking the multiple vedge dashboard to the single vedge history dashboard

Of course the more interactive you make these things the cooler it becomes, right? I was excited to be able to link these two dashboards together in a sensible way.

In the panel config you have Data links. I found this link works:

https://drjohns.com:3000/d/1MqpjD24k/single-vedge-usage-history?orgId=1&var-threshold=60&var-math=n95&${item:queryparam}

So to generalize since most of the URL is specific to my implementation, both dashboards utilize the item variable. I basically discovered the URL for a single vedge dashboard and dissected it and parameterized the item, getting the syntax right with a little Internet research.

So the net effect is that when you hover over any of the vedge panels in the multi-vedge dashboard, you can click on that vedge and pull up – in a new tab in my case – the individual vedge usage history. It’s pretty awesome.

Influxdb

Influxdb is a time series database. It takes some getting used to. Here is my cheat sheet which I like to refer to.

  • bucket is named location with retention policy where time-series data is stored.
  • series is a logical grouping of data defined by shared measurement, tag and field.
  • measurement is similar to an SQL database table.
  • tag is similar to indexed columns in an SQL database.
  • field is similar to unindexed columns in an SQL database.
  • point is similar to SQL row.

This is not going to make a lot of sense to anyone who isn’t Dr John. But I’m sure I’ll be referring to this section for a How I did this reminder.

OK. So I wrote a feed_influxdb.py script which runs every 12 minutes in an Azure DevOps pipeline. It extracts the relevant vedge data from Zabbix using the Zabbix api and puts it into my influxdb measurement vedge whose definition I have shown above. I would say the code is fairly generic, except that it relies on the existence of a master file which contains all the relevant static data about the vedges such as their interface names, Zabbix itemids, and their maximum bandwidth (we called it zabbixSpeed). You cuold pretty much deduce the format of this master file by reverse-engineering this script. So anyway here is feed_influxdb.py.

                    

from pyzabbix import ZabbixAPI
import requests, json, sys, os, re
import time,datetime
from time import sleep
from influxdb_client import InfluxDBClient, Point, WritePrecision
from influxdb_client.client.write_api import SYNCHRONOUS
from modules import aux_modules,influx_modules

# we need to get data out of Zabbix
inventory_file = 'prod.config.visibility_dashboard_reporting.json'
#inventory_file = 'inv-w-bw.json' # this is a modified version of the above and includes Zabbix bandwidth for most ports
# Login Zabbix API - use hidden variable to this pipeline
token_zabbix = os.environ['ZABBIX_AUTH_TOKEN']
url_zabbix = 'https://zabbix.basf.net/'
zapi = ZabbixAPI(url_zabbix)
zapi.login(api_token=token_zabbix)
# Load inventory file
with open(inventory_file) as inventory_file:
    inventory_json = json.load(inventory_file)
# Time range which want to get data (unixtimestamp)
inventory_s = json.dumps(inventory_json)
inventory_d = json.loads(inventory_s)
time_till = int(time.mktime(datetime.datetime.now().timetuple()))
time_from = int(time_till - 780)  # about 12 minutes plus an extra minute to reflect start delay, etc
i=0
max_items = 200
item_l = []
itemid_to_vedge,itemid_to_ltype,itemid_to_bw,itemid_to_itemname = {},{},{},{}
for SSID in inventory_d:
    print('SSID',SSID)
    hostname_d = inventory_d[SSID]['hostname']
    for vedge_s in hostname_d:
        print('vedge_s',vedge_s,flush=True)
        items_l = hostname_d[vedge_s]
        for item_d in items_l:
            print('item_d',item_d,flush=True)
            itemname = item_d['itemname']
            lineType = item_d['lineType']
            if 'zabbixSpeed' in item_d:
                bandwidth = int(item_d['zabbixSpeed'])
            else:
                bandwidth = 0
            itemid = item_d['itemid']
            if lineType == 'MPLS' or lineType == 'Internet':
                i = i + 1
                itemid_to_vedge[itemid] = vedge_s # we need this b.c. Zabbix only returns itemid
                itemid_to_ltype[itemid] = lineType # This info is nice to see
                itemid_to_bw[itemid] = bandwidth # So we can get percentage used
                itemid_to_itemname[itemid] = itemname # So we can get percentage used
                item_l.append(itemid)
                if i > max_items:
                    print('item_l',item_l,flush=True)
                    params = {'itemids':item_l,'time_from':time_from,'time_till':time_till,'history':0,'limit':500000}
                    print('params',params)
                    res_d = zapi.do_request('history.get',params)
                    #print('res_d',res_d)
                    #exit()
                    print('After call to zapi.do_request')
                    result_l = res_d['result']
                    Pts = aux_modules.zabbix_to_pts(result_l,itemid_to_vedge,itemid_to_ltype,itemid_to_bw,itemid_to_itemname)
                    for Pt in Pts:
                        print('Pt',Pt,flush=True)
                        # DEBUGGING!!! Normally call to data_entry is outside of this loop!!
                        #influx_modules.data_entry([Pt])
                    influx_modules.data_entry(Pts)
                    item_l = [] # empty out item list
                    i = 0
                    sleep(0.2)
else:
# we have to deal with leftovers which did not fill the max_items
    if i > 0:
                    print('Remainder section')
                    print('item_l',item_l,flush=True)
                    params = {'itemids':item_l,'time_from':time_from,'time_till':time_till,'history':0,'limit':500000}
                    res_d = zapi.do_request('history.get',params)
                    print('After call to zapi.do_request')
                    result_l = res_d['result']
                    Pts = aux_modules.zabbix_to_pts(result_l,itemid_to_vedge,itemid_to_ltype,itemid_to_bw,itemid_to_itemname)
                    for Pt in Pts:
                        # DEBUGGING!!! normally data_entry is called after this loop
                        print('Pt',Pt,flush=True)
                        #influx_modules.data_entry([Pt])
                    influx_modules.data_entry(Pts)
print('All done feeding influxdb!')

I’m not saying it’s great code. I’m only saying that it gets the job done. It relies on a couple auxiliary scripts. Here is aux_modules.py (it may include some packages I need later on).

                    

import re
import time as tm
import numpy as np

def zabbix_to_pts(result_l,itemid_to_vedge,itemid_to_ltype,itemid_to_bw,itemid_to_itemname):

# turn Zabbix results into a list of points which can be fed into influxdb
# [{'itemid': '682837', 'clock': '1671036337', 'value': '8.298851463718859E+005', 'ns': '199631779'},

    Pts = []
    for datapt_d in result_l:
        itemid = datapt_d['itemid']
        time = datapt_d['clock']
        value_s = datapt_d['value']
        value = float(value_s) # we are getting a floating point represented as a string. Convert back to float
        hostname = itemid_to_vedge[itemid]
        lineType = itemid_to_ltype[itemid]
        itemname = itemid_to_itemname[itemid]
# item is a hybrid tag, like a primary tag key
        iface_dir = re.sub(r'(\S+) interface (\S+) .+',r'\1_\2',itemname)
        item = hostname + '_' + lineType + '_' + iface_dir
        if itemid in itemid_to_bw:
            bw_s = itemid_to_bw[itemid]
            bw = int(bw_s)
            if bw == 0:
                percent = 0
            else:
                percent = int(100*value/bw)
        else:
            percent = 0
        tags = [{'tag':'hostname','value':hostname},{'tag':'itemid','value':itemid},{'tag':'ltype','value':lineType},{'tag':'item','value':item}]
        fields = [value,percent]
        Pt = {'measurement':'vedge','tags':tags,'fields':fields,'time':time}
        Pts.append(Pt)
    return Pts

Next I’ll show influx_modules.py.

                    

import influxdb_client, os, time
from datetime import datetime, timezone
import pytz
from influxdb_client import InfluxDBClient, Point, WritePrecision
from influxdb_client.client.write_api import SYNCHRONOUS
import random

def data_entry(Pts):
# Set up variables
    bucket = "poc_bucket2" # DrJ test bucket
    org = "poc_org"
    influxdb_cloud_token = os.environ['INFLUX_AUTH_TOKEN']

# Store the URL of your InfluxDB instance
    url_local ="https://drjohnstechtalk.com:8086/"

# Initialize client
    client = influxdb_client.InfluxDBClient(url=url_local,token=influxdb_cloud_token,org=org)

# Write data
    write_api = client.write_api(write_options=SYNCHRONOUS)

    pts = []
    for Pt in Pts:
        measurement = Pt['measurement']
        tag_host_l = Pt['tags']
# for now we've hard-coded the three tags and two field values
        hostname = tag_host_l[0]['value']
        itemid = tag_host_l[1]['value']
        lineType = tag_host_l[2]['value']
        item = tag_host_l[3]['value']
        value = Pt['fields'][0]
        percent = Pt['fields'][1]

        time = Pt['time']
# convert seconds since epoch into format required by influxdb
        pt_time = datetime.fromtimestamp(int(time), timezone.utc).isoformat('T', 'milliseconds')
# pull out the UTC hour
        ts=datetime.fromtimestamp(int(time)).astimezone(pytz.UTC)
        H = ts.strftime('%H')
        point = Point(measurement).tag("hostname",hostname).tag("itemid",itemid).tag("ltype",lineType).tag("item",item).tag("UTChour",H).field('value',value).field('percent',percent).time(pt_time)
        pts.append(point)
    write_api.write(bucket=bucket, org="poc_org", record=pts, write_precision=WritePrecision.S)

These scripts show how I accumulate a bunch of points and make an entry in influxdb once I have a bunch of them to make things go faster.

Complaints

I am not comfortable with the flux query documentation. Nor the Grafana documentation for that matter. They both give you a taste of what you need, without many details or examples. For instance it looks like there are multiple syntaxes available to you using flux. I just pragmatically developed what works.

Conclusion

I am content if no one reads this and it only serves as my own documentation. But perhaps it will help someone facing similar issues. Unfortunately one of the challenges is asking a good question to a search engine when you’re a newbie and haven’t mastered the concepts.

But without too too much effort I was able to master enough Grafana to create something that will probably be useful to our vendor management team. Grafana is fun and powerful. There is a slight lack of examples and the documentation is a wee bit sparse.

Leave a Reply

Your email address will not be published. Required fields are marked *