Categories
Cloud

ADO: Check pipeline runs

Intro

I have previously written how to copy all Azure DevOps (ADO) logs to a linux server. In this post I share a script I wrote which does a quality check on all the most recent pipeline runs. If there are any issues, a message is sent to a MS teams channel.

Let’s get into the details.

Preliminary details

I am using the api, needless to say. I cannot say I have mastered the api or even come close to understanding it. I however have leveraged the same api call I have previously used since I observed it contains a lot of interesting data.

conf_check_all.ini

This config file is written as json to make importing a breeze. You set up optional trigger conditions for the various pipeline runs you will have, or even whether or not to perform any checks on it at all.

{
"organization":"drjohns4ServicesCoreSystems",
"project":"Connectivity",
"url_base":"https://dev.azure.com/",
"url_params":"&api-version=7.1-preview.7",
"test_flag":false,
"run_ct_min":2,
"queue_time_max":1800,
"pipelines":{
"comment":{"maximum_processing_time_in_seconds":"integer_value","minimum_processing_time_in_seconds":"integer_value","(optional) check_flag - to potentially disable the checks for this pipeline":"either true or false"},
"default":{"max_proc_time":1800,"min_proc_time":3,"check_flag":true},
"feed_influxdb":{"max_proc_time":180,"min_proc_time":3},
"PAN-Usage4Mgrs":{"max_proc_time":900,"min_proc_time":60,"check_flag":true},
"PAN-Usage4Mgrs-2":{"max_proc_time":900,"min_proc_time":60},
"speed-up-sampling":{"max_proc_time":900,"min_proc_time":"2","check_flag":false},
"Pipeline_check":{"max_proc_time":45,"min_proc_time":"2","check_flag":false},
"Discover new vEdges":{"max_proc_time":3600,"min_proc_time":3,"check_flag":true},
}
}

So you see at the bottom is a dictionary where the keys are the names of the pipelines I am running, plus a default entry.

check_all_pipelines.py
                    

#!/usr/bin/python3
# fetch raw log to local machine
# for relevant api section, see:
#https://learn.microsoft.com/en-us/rest/api/azure/devops/build/builds/get-build-log?view=azure-devops-rest-7.1
import urllib.request,json,sys,os
from datetime import datetime,timedelta
from modules import aux_modules

conf_file = sys.argv[1]

# pipeline uses UTC so we must follow suit or we will miss files
#a_day_ago = (datetime.utcnow() - timedelta(days = 1)).strftime('%Y-%m-%dT%H:%M:%SZ')
startup_delay = 30 # rough time in seconds before the pipeline even begins to execute our script
an_hour_ago = (datetime.utcnow() - timedelta(hours = 1, seconds = startup_delay)).strftime('%Y-%m-%dT%H:%M:%SZ')
print('An hour ago was (UTC)',an_hour_ago)
format = '%Y-%m-%dT%H:%M:%SZ'

#url = 'https://dev.azure.com/drjohns4ServicesCoreSystems/Connectivity/_apis/build/builds?minTime=2022-10-11T13:00:00Z&api-version=7.1-preview.7'

# dump config file into a dict
config_d = aux_modules.parse_config(conf_file)
test_flag = config_d['test_flag']
if test_flag:
    print('config_d',config_d)
    print('We are in a testing mode because test_flag is:',test_flag)

url_base = f"{config_d['url_base']}{config_d['organization']}/{config_d['project']}/_apis/build/builds"
url = f"{url_base}?minTime={an_hour_ago}{config_d['url_params']}"
#print('url',url)
req = urllib.request.Request(url)
req.add_header('Authorization', 'Basic ' + os.environ['ADO_AUTH'])

# Get buildIds for pipeline runs from last 1 hour
with urllib.request.urlopen(req) as response:
   html = response.read()
txt_d = json.loads(html)
#{"count":215,"value":[{"id":xxx, "buildNumber":"20230203.107","status":"completed","result":"succeeded","queueTime":"2023-02-03T21:12:01.0865046Z","startTime":"2023-02-03T21:12:05.2177605Z","finishTime":"2023-02-03T21:17:28.1523128Z","definition":{"name":"PAN-Usage4Mgrs-2"
value_l = txt_d['value']
all_msgs = ''
header_msg = '**Recent pipeline issues**\n'
# check for too few pipeline runs
if len(value_l) <= config_d['run_ct_min']:
    all_msgs = f"There have been fewer than expected pipeline runs this past hour. Greater than **{config_d['run_ct_min']}** runs are expected, but there have been only **{len(value_l)}** runs.  \nSeomthing may be wrong.  \n"

for builds in value_l:
    msg = aux_modules.check_this_build(builds,config_d,url_base)
    if msg: all_msgs = f"{all_msgs}  \n{msg}  \n"

if all_msgs:
    if not test_flag: aux_modules.sendMessageToTeams(header_msg + all_msgs) # send to WebHook if not in a testing mode
    print(header_msg + all_msgs)
else:
    print('No recent pipeline errors')

Short explanation

I consider the code to be mostly self-explanatory. A cool thing I’m trying out here is the f- format specifier to write to a string kind of like sprintf. I run this script every hour from, yes, an ADO pipeline! But since this job looks for errors, including errors which indicate a systemic problem with the agent pool, I run it from a different agent pool.

aux_modules.py
                    

import json,re
import os,urllib.request
from datetime import datetime,timedelta
import pymsteams

def parse_config(conf_file):
# config file should be a json file
    f = open(conf_file)
    config_d = json.load(f)
    f.close()
    return config_d

def get_this_log(config_d,name,buildId,build_number):
# leaving out the api-version etc works better
#GET https://dev.azure.com/{organization}/{project}/_apis/build/builds/{buildId}/logs/{logId}?api-version=7.1-preview.2
#https://dev.azure.com/drjohns4ServicesCoreSystems/d6338e-f5b4-45-6c-7b3a86/_apis/build/builds/44071/logs/7'
        buildId_s = str(buildId)
        log_name = config_d['log_dir'] + "/" + name + "-" + build_number
# check if we already got this one
        if os.path.exists(log_name):
            return
        #url = url_base + organization + '/' + project + '/_apis/build/builds/' + buildId_s + '/logs/' + logId + '?' + url_params
        url = config_d['url_base'] + config_d['organization'] + '/' + config_d['project'] + '/_apis/build/builds/' + buildId_s + '/logs/' + config_d['logId']
        print('url for this log',url)
        req = urllib.request.Request(url)
        req.add_header('Authorization', 'Basic ' + config_d['auth'])
        with urllib.request.urlopen(req) as response:
            html = response.read()
        #print('log',html)
        print("Getting (name,build_number,buildId,logId) ",name,build_number,buildId_s,config_d['logId'])
        f = open(log_name,"wb")
        f.write(html)
        f.close()

def check_this_build(builds,config_d,url_base):
    format = '%Y-%m-%dT%H:%M:%SZ'
    buildId = builds['id']
    build_number = builds['buildNumber']
    status = builds['status'] # normally: completed
    result = builds['result'] # normally: succeeded
    queueTime = builds['queueTime']
    startTime = builds['startTime']
    finishTime = builds['finishTime']
    build_def = builds['definition']
    name = build_def['name']
    print('name,build_number,id',name,build_number,buildId)
    print('status,result,queueTime,startTime,finishTime',status,result,queueTime,startTime,finishTime)
    qTime = re.sub(r'\.\d+','',queueTime)
    fTime = re.sub(r'\.\d+','',finishTime)
    sTime = re.sub(r'\.\d+','',startTime)
    qt_o = datetime.strptime(qTime, format)
    ft_o = datetime.strptime(fTime, format)
    st_o = datetime.strptime(sTime, format)
    duration_o = ft_o - st_o
    duration = int(duration_o.total_seconds())
    print('duration',duration)
    queued_time_o = st_o - qt_o
    queued_time = int(queued_time_o.total_seconds())
    queue_time_max = config_d['queue_time_max']
# and from the config file we have...
    pipes_d = config_d['pipelines']
    this_pipe = pipes_d['default']
    if name in pipes_d: this_pipe = pipes_d[name]
    msg = ''
    if 'check_flag' in this_pipe:
        if not this_pipe['check_flag']:
            print('Checking for this pipeline has been disabled: ',name)
            return msg # skip this build if in test mode or whatever
    print('duration,min_proc_time,max_proc_time',duration,this_pipe['min_proc_time'],this_pipe['max_proc_time'])
    print('queued_time,queue_time_max',queued_time,queue_time_max)
    if duration > this_pipe['max_proc_time'] or duration < this_pipe['min_proc_time']:
        msg = f"ADO Pipeline **{name}** run is outside of expected time range. Build number: **{build_number}**. \n  Duration, max_proc_time, min_proc_time: **{duration},{this_pipe['max_proc_time']},{this_pipe['min_proc_time']}**"
    if not status == 'completed' or not result == 'succeeded':
        msg = f"ADO Pipeline **{name}** run has unexpected status or result. Build number: **{build_number}**. \n  - Status: **{status}** \n  - Result: **{result}**"
    if queued_time > queue_time_max: # Check if this job was queued for too long
        msg = f"ADO Pipeline **{name}** build number **{build_number}** was queued too long. Queued time was **{queued_time}** seconds"
    if msg:
# get the logs meta info to see which log is the largest
        url = f"{url_base}/{buildId}/logs"
        req = urllib.request.Request(url)
        req.add_header('Authorization', 'Basic ' + os.environ['ADO_AUTH'])
# Get buildIds for pipeline runs from last 1 hour
        with urllib.request.urlopen(req) as response:
            html = response.read()
        txt_d = json.loads(html)
        value_l = txt_d['value']
#{"count":11,"value":[{"lineCount":31,"createdOn":"2023-02-13T19:03:17.577Z","lastChangedOn":"2023-02-13T19:03:17.697Z","id":1...
        l_ct_max = 0
        log_id_err = 0
# determine log with either an error or the most lines - it differs for different pipeline jobs
        for logs_d in value_l[4:]:      # only consider the later logs
            url = f"{url_base}/{buildId}/logs/{logs_d['id']}"
            req = urllib.request.Request(url)
            req.add_header('Authorization', 'Basic ' + os.environ['ADO_AUTH'])
            with urllib.request.urlopen(req) as response:
                html = response.read().decode('utf-8')
            if re.search('error',html):
                log_id_err = logs_d['id']
                print('We matched the word error in log id',log_id_err)
            l_ct = logs_d['lineCount']
            if l_ct > l_ct_max:
                l_ct_max = l_ct
                log_id_all = logs_d['id']
        if log_id_err > 0 and not log_id_all == log_id_err: # error over long log file when in conflict
            log_id_all = log_id_err
        url_all_logs = f"{url_base}/{buildId}/logs/{log_id_all}"
        msg = f"{msg}  \n**[Go to Log]({url_all_logs})**  "
    print(msg)
    return msg

def sendMessageToTeams(msg: str):
    """
    Send a message to a Teams Channel using webhook
    """
# my Pipeline_check webhook
    webHookUrl = "https://drjohns.webhook.office.com/webhookb2/66f741-9b1e-401c-a8d3-9448d352db@ec386b-c8f-4c0-a01-740cb5ba55/IncomingWebhook/2c8e881d05caba4f484c92617/7909f-d2f-b1d-3c-4d82a54"
    try:
        # escaping underscores to avoid alerts in italics.
        msg = msg.replace('_', '\_')
        teams_msg = pymsteams.connectorcard(webHookUrl)
        teams_msg.text(f'{msg}')
        teams_msg.send()

    except Exception as e:
        print(f'failed to send alert: {str(e)}')

aux_modules.py contains most of the logic with checking each pipeline against the criteria and constructing an alert in Markdown to send to MS Teams. I’m not saying it’s beautiful code. I’m still learning. But I am saying it works.

I’ve revised the code to find the log file which is most likely to contain the “interesting” stuff. That’s usually the longest one excluding the first five or so. There are often about 10 logs available for even a minimal pipeline run. So this extra effort helps.

Then I further revised the code to fetch the logs and look for the word “error.” That may show up in the longest log or it may not. It not, that log takes precedence as the most interesting log.

check_all_pipelines.yml
                    

# Python package
# Create and test a Python package on multiple Python versions.
# Add steps that analyze code, save the dist with the build record, publish to a PyPI-compatible index, and more:
# https://docs.microsoft.com/azure/devops/pipelines/languages/python

##trigger:
##- main

trigger: none

pool:
  name: dsc-adosonar-drjohns4ServicesCoreSystems-agent
#  name: visibility_agents

#strategy:
#  matrix:
#    Python36:
#      python.version: '3.6'

steps:
#- task: UsePythonVersion@0
#  inputs:
#    versionSpec: '$(python.version)'
#  displayName: 'Use Python $(python.version)'

- script: pip3 install -vvv --timeout 60 -r Pipeline_check/requirements.txt
  displayName: 'Install requirements'

- script: python3 check_all_pipelines.py conf_check_all.ini
  displayName: 'Run script'
  workingDirectory: $(System.DefaultWorkingDirectory)/Pipeline_check
  env:
    ADO_AUTH: $(ado_auth)
    PYTHONPATH: $(System.DefaultWorkingDirectory)/Pipeline_check:$(System.DefaultWorkingDirectory)
schedules:
- cron: "19 * * * *"
  displayName: Run the script at 19 minutes after the hour
  branches:
    include:
    - main
  always: true

This yaml file we sort of drag around from pipeline to pipeline so some of it may not appear too optimized for this particular pipeline. But it does the job without fuss.

Conclusion

My ADO pipeline checker is conveniently showing us all our pipeline runs which have failed for various reasons – takes too long, completed with errors, too few jobs have been run ni the last hour, … It sends its output to a MS Teams channel we have subscribed to, by way of a webhook we set up. So far it’s working great!

References and related

Here’s my post on fetching the script log resulting from the pipeline run.

Categories
Visibility

Everything I need to know about Influxdb and Grafana

Intro

Some of my colleagues had used Influxdb and Grafana at their previous job so they thought it might fit for what we’re doing in the Visibility team. It sounded good in theory, anyway, so I had to agree. There were a lot of pitfalls. Eventually I got it to the point where I’m satisfied with my accomplishments and want to document the hurdles I’ve overcome.

So as time permits I will be fleshing this out.

Grafana

I’m going to lead with the picture and then the explanation makes a lot more sense.

Single vedge heat map

I’ve spent the bulk of my time wrestling with Grafana. Actually it looks like a pretty capable tool. It’s mostly just understanding how to make it do what you are dreaming about. Our installed version currently is 9.2.1.

My goal is to make a heatmap. But a special kind similar to what I saw the network provider has. That would namely entail one vedge per row, and one column per hour, hence, 24 columns in total. A vedge is a kind of SD-Wan router. I want to help the networking group look at hundreds of them at a time. So that’s on potential dashboard. It would give a view of a day. Another dashboard would show just one router with the each row representing a day, and the columns again showing an hour. Also a heatmap. The multi-vedge dashboard should link to the individual dashboard, ideally. In the end I pulled it off. I am also responsible for feeding the raw data into Influxdb and hence also for the table design.

Getting a workable table design was really imporant. I tried to design it in a vacuum, but that only partially worked. So I revised, adding tags and fields as I felt I needed to, while being mindful of not blowing up the cardinality. I am now using these two tables, sorry, measurements.

vedge measurement
vedge_stats measurement

Although there are hundreds of vedges, some of my tags are redundant, so don’t get overly worried about my high cardinality. UTChour is yes a total kludge – not the “right” way to do things. But I’m still learning and it was simpler in my mind. item in the first measurement is redundant with itemid. But it is more user-friendly: a human-readable name.

Influx Query Explorer

It’s very helpful to use the Explorer, but the synatx there is not exactly the same as it will be when you define template variables. Go figure.

Multiple vedges for the last day

So how did I do it in the end?

Mastering template variables is really key here. I have a drop-down selection for region. In Grafana-world it is a custom variable with potential values EU,NA,SA,AP. That’s pretty easy. I also have a threshold variable, with possible values: 0,20,40,60,80,90,95. And a math variable with values n95,avg,max. More recently I’ve added a threshold_max and a math_days variable.

It gets more interesting however, I promise. I have a category variable which is of type query:

from(bucket: "poc_bucket2")
|> range (start: -1d)
|> filter(fn:(r) => r._measurement == "vedge_stats")
|> group()
|> distinct(column: "category")

Multi-value and Include all options are checked. Just to make it meaningful, category is assigned by the WAN provider and has values such as Gold, Silver, Bronze.

And it gets still more interesting because the last variable depends on the earlier ones, hence we are using chained variables. The last variable, item, is defined thusly:

from(bucket: "poc_bucket2")
|> range (start: -${math_days}d)
|> filter(fn:(r) => r._measurement == "vedge_stats" and r.region == "${Region}")
|> filter(fn:(r) => contains(value: r.category, set: ${category:json}))
|> filter(fn:(r) => r._field == "${math}" and r._value >= ${threshold} and r._value <= ${threshold_max})
|> group()
|> distinct(column: "item")

So what it is designed to do is to generate a list of all the items, which in reality are particular interfaces of the vedges, into a drop-down list.

Note that I want the user to be able to select multiple categories. It’s not well-documented how to chain such a variable, so note the use of contains and set in that one filter function.

And note the double-quotes around ${Region}, another chained variable. You need those double-quotes! It kind of threw me because in Explorer I believe you may not need them.

And all that would be simply nice if we didn’t somehow incorporate these template variables into our panels. I use the Stat visualization. So you’ll get one stat per series. That’s why I artifically created a tag UTChour, so I could easily get a unique stat box for each hour.

The stat visualization flux Query

Here it is…

data = from(bucket: "poc_bucket2")
  |> range(start: -24h, stop: now())
  |> filter(fn: (r) =>
    r._measurement == "vedge" and
    r._field == "percent" and r.hostname =~ /^${Region}/ and r.item == "${item}"
  )
  |> drop(columns: ["itemid","ltype","hostname"])
data

Note I hae dropped my extra tags and such which I do not wish to appear during a mouseover.

Remember our regions can be one of AP,EU,NA or SA? Well the hostnames assigned to each vedge start with the two letters of its region of location. Hence the regular explression matching works there to restrict consideration to just the vedges in the selected region.

We are almost done.

Making it a heat map

So my measurement has a tag called percent, which is the percent of available bandwidth that is being used. So I created color-based thresholds:

Colorful percent-based thresholds

You can imagine how colorful the dashboard gets as you ratchet up the threshold template variable. So the use of these thresholds is what turns our stat squares into a true heatmap.

Heatmap visualization

I found the actual heatmap visualization useless for my purposes, by the way!

There is also an unsupported heatmap plugin for Grafana which simply doesn’t work. Hence my roll-your-own approach.

Repitition

How do we get a panel row per vedge? The stat visualization has a feature called Repeat Options. So you repeat by variable. The variable selected is item. Remember that item came from our very last template variable. Repeat direction is Vertical.

For calculation I choose mean. Layout orienttion is Vertical.

The visualization title is also variable-driven. It is ${item} .

The panels are long and thin. Like maybe two units high? – one unit for the label (the item) and the one below it for the 24 horizontal stat boxes.

Put it all together and voila, it works and it’s cool and interactive and fast!

Single vedge heatmap data over multiple days

Of course this is very similar to the multiple vedge dashboard. But now we’re drilling down into a single vedge to look at its usage over a period of time, such as the last two weeks.

Flux query
import "date"
b = date.add(d: 1d, to: -${day}d)
data = from(bucket: "poc_bucket2")
  |> range(start: -${day}d, stop: b)
  |> filter(fn: (r) =>
    r._measurement == "vedge" and
    r._field == "percent" and
    r.item == "$item"
  )
  |> drop(columns:["itemid","ltype","hostname"])
data
Variables

As before we have a threshold, Region and category variable with category derived from the same flux query shown above. A new variable is day, which is custom and hidden, It has values 1,2,3,4,…,14. I don’t know how to do a loop in flux or I might have opted a more elegant method to specify the last 14 days.

I did the item variable query a little different, but I think it’s mostly an alternate and could have been the same:

from(bucket: "poc_bucket2")
|> range (start: -${math_days}d)
|> filter(fn:(r) => r._measurement == "vedge_stats" and r.region == "${Region}")
|> filter(fn:(r) => contains(value: r.category, set: ${category:json}))
|> filter(fn:(r) => r._field == "${math}" and r._value >= ${threshold} and r._value <= ${threshold_max})
|> group()
|> distinct(column: "item")

Notice the slightly different handling of Region. And those double-quotes are important, as I learned from the school of hard knocks!

The flux query in the panel is of course different. It looks like this:

import "date"
b = date.add(d: 1d, to: -${day}d)
data = from(bucket: "poc_bucket2")
  |> range(start: -${day}d, stop: b)
  |> filter(fn: (r) =>
    r._measurement == "vedge" and
    r._field == "percent" and
    r.item == "$item"
  )
  |> drop(columns:["itemid","ltype","hostname"])
data

So we’re doing some date arithmetic so we can get panel strips, one per day. These panels are long and thin, same as before, but I omitted the title since it’s all the same vedge.

The repeat options are repeat by variable day, repeat direction Vertical as in the other dashboard. The visualization is Stat, as in the other dashboard.

And that’s about it! Here the idea is that you play with the independent variables such as Region and threshold, it generates a list of matching vedge interfaces and you pick one from the drop-down list.

Linking the multiple vedge dashboard to the single vedge history dashboard

Of course the more interactive you make these things the cooler it becomes, right? I was excited to be able to link these two dashboards together in a sensible way.

In the panel config you have Data links. I found this link works:

https://drjohns.com:3000/d/1MqpjD24k/single-vedge-usage-history?orgId=1&var-threshold=60&var-math=n95&${item:queryparam}

So to generalize since most of the URL is specific to my implementation, both dashboards utilize the item variable. I basically discovered the URL for a single vedge dashboard and dissected it and parameterized the item, getting the syntax right with a little Internet research.

So the net effect is that when you hover over any of the vedge panels in the multi-vedge dashboard, you can click on that vedge and pull up – in a new tab in my case – the individual vedge usage history. It’s pretty awesome.

Influxdb

Influxdb is a time series database. It takes some getting used to. Here is my cheat sheet which I like to refer to.

  • bucket is named location with retention policy where time-series data is stored.
  • series is a logical grouping of data defined by shared measurement, tag and field.
  • measurement is similar to an SQL database table.
  • tag is similar to indexed columns in an SQL database.
  • field is similar to unindexed columns in an SQL database.
  • point is similar to SQL row.

This is not going to make a lot of sense to anyone who isn’t Dr John. But I’m sure I’ll be referring to this section for a How I did this reminder.

OK. So I wrote a feed_influxdb.py script which runs every 12 minutes in an Azure DevOps pipeline. It extracts the relevant vedge data from Zabbix using the Zabbix api and puts it into my influxdb measurement vedge whose definition I have shown above. I would say the code is fairly generic, except that it relies on the existence of a master file which contains all the relevant static data about the vedges such as their interface names, Zabbix itemids, and their maximum bandwidth (we called it zabbixSpeed). You cuold pretty much deduce the format of this master file by reverse-engineering this script. So anyway here is feed_influxdb.py.

                    

from pyzabbix import ZabbixAPI
import requests, json, sys, os, re
import time,datetime
from time import sleep
from influxdb_client import InfluxDBClient, Point, WritePrecision
from influxdb_client.client.write_api import SYNCHRONOUS
from modules import aux_modules,influx_modules

# we need to get data out of Zabbix
inventory_file = 'prod.config.visibility_dashboard_reporting.json'
#inventory_file = 'inv-w-bw.json' # this is a modified version of the above and includes Zabbix bandwidth for most ports
# Login Zabbix API - use hidden variable to this pipeline
token_zabbix = os.environ['ZABBIX_AUTH_TOKEN']
url_zabbix = 'https://zabbix.basf.net/'
zapi = ZabbixAPI(url_zabbix)
zapi.login(api_token=token_zabbix)
# Load inventory file
with open(inventory_file) as inventory_file:
    inventory_json = json.load(inventory_file)
# Time range which want to get data (unixtimestamp)
inventory_s = json.dumps(inventory_json)
inventory_d = json.loads(inventory_s)
time_till = int(time.mktime(datetime.datetime.now().timetuple()))
time_from = int(time_till - 780)  # about 12 minutes plus an extra minute to reflect start delay, etc
i=0
max_items = 200
item_l = []
itemid_to_vedge,itemid_to_ltype,itemid_to_bw,itemid_to_itemname = {},{},{},{}
for SSID in inventory_d:
    print('SSID',SSID)
    hostname_d = inventory_d[SSID]['hostname']
    for vedge_s in hostname_d:
        print('vedge_s',vedge_s,flush=True)
        items_l = hostname_d[vedge_s]
        for item_d in items_l:
            print('item_d',item_d,flush=True)
            itemname = item_d['itemname']
            lineType = item_d['lineType']
            if 'zabbixSpeed' in item_d:
                bandwidth = int(item_d['zabbixSpeed'])
            else:
                bandwidth = 0
            itemid = item_d['itemid']
            if lineType == 'MPLS' or lineType == 'Internet':
                i = i + 1
                itemid_to_vedge[itemid] = vedge_s # we need this b.c. Zabbix only returns itemid
                itemid_to_ltype[itemid] = lineType # This info is nice to see
                itemid_to_bw[itemid] = bandwidth # So we can get percentage used
                itemid_to_itemname[itemid] = itemname # So we can get percentage used
                item_l.append(itemid)
                if i > max_items:
                    print('item_l',item_l,flush=True)
                    params = {'itemids':item_l,'time_from':time_from,'time_till':time_till,'history':0,'limit':500000}
                    print('params',params)
                    res_d = zapi.do_request('history.get',params)
                    #print('res_d',res_d)
                    #exit()
                    print('After call to zapi.do_request')
                    result_l = res_d['result']
                    Pts = aux_modules.zabbix_to_pts(result_l,itemid_to_vedge,itemid_to_ltype,itemid_to_bw,itemid_to_itemname)
                    for Pt in Pts:
                        print('Pt',Pt,flush=True)
                        # DEBUGGING!!! Normally call to data_entry is outside of this loop!!
                        #influx_modules.data_entry([Pt])
                    influx_modules.data_entry(Pts)
                    item_l = [] # empty out item list
                    i = 0
                    sleep(0.2)
else:
# we have to deal with leftovers which did not fill the max_items
    if i > 0:
                    print('Remainder section')
                    print('item_l',item_l,flush=True)
                    params = {'itemids':item_l,'time_from':time_from,'time_till':time_till,'history':0,'limit':500000}
                    res_d = zapi.do_request('history.get',params)
                    print('After call to zapi.do_request')
                    result_l = res_d['result']
                    Pts = aux_modules.zabbix_to_pts(result_l,itemid_to_vedge,itemid_to_ltype,itemid_to_bw,itemid_to_itemname)
                    for Pt in Pts:
                        # DEBUGGING!!! normally data_entry is called after this loop
                        print('Pt',Pt,flush=True)
                        #influx_modules.data_entry([Pt])
                    influx_modules.data_entry(Pts)
print('All done feeding influxdb!')

I’m not saying it’s great code. I’m only saying that it gets the job done. It relies on a couple auxiliary scripts. Here is aux_modules.py (it may include some packages I need later on).

                    

import re
import time as tm
import numpy as np

def zabbix_to_pts(result_l,itemid_to_vedge,itemid_to_ltype,itemid_to_bw,itemid_to_itemname):

# turn Zabbix results into a list of points which can be fed into influxdb
# [{'itemid': '682837', 'clock': '1671036337', 'value': '8.298851463718859E+005', 'ns': '199631779'},

    Pts = []
    for datapt_d in result_l:
        itemid = datapt_d['itemid']
        time = datapt_d['clock']
        value_s = datapt_d['value']
        value = float(value_s) # we are getting a floating point represented as a string. Convert back to float
        hostname = itemid_to_vedge[itemid]
        lineType = itemid_to_ltype[itemid]
        itemname = itemid_to_itemname[itemid]
# item is a hybrid tag, like a primary tag key
        iface_dir = re.sub(r'(\S+) interface (\S+) .+',r'\1_\2',itemname)
        item = hostname + '_' + lineType + '_' + iface_dir
        if itemid in itemid_to_bw:
            bw_s = itemid_to_bw[itemid]
            bw = int(bw_s)
            if bw == 0:
                percent = 0
            else:
                percent = int(100*value/bw)
        else:
            percent = 0
        tags = [{'tag':'hostname','value':hostname},{'tag':'itemid','value':itemid},{'tag':'ltype','value':lineType},{'tag':'item','value':item}]
        fields = [value,percent]
        Pt = {'measurement':'vedge','tags':tags,'fields':fields,'time':time}
        Pts.append(Pt)
    return Pts

Next I’ll show influx_modules.py.

                    

import influxdb_client, os, time
from datetime import datetime, timezone
import pytz
from influxdb_client import InfluxDBClient, Point, WritePrecision
from influxdb_client.client.write_api import SYNCHRONOUS
import random

def data_entry(Pts):
# Set up variables
    bucket = "poc_bucket2" # DrJ test bucket
    org = "poc_org"
    influxdb_cloud_token = os.environ['INFLUX_AUTH_TOKEN']

# Store the URL of your InfluxDB instance
    url_local ="https://drjohnstechtalk.com:8086/"

# Initialize client
    client = influxdb_client.InfluxDBClient(url=url_local,token=influxdb_cloud_token,org=org)

# Write data
    write_api = client.write_api(write_options=SYNCHRONOUS)

    pts = []
    for Pt in Pts:
        measurement = Pt['measurement']
        tag_host_l = Pt['tags']
# for now we've hard-coded the three tags and two field values
        hostname = tag_host_l[0]['value']
        itemid = tag_host_l[1]['value']
        lineType = tag_host_l[2]['value']
        item = tag_host_l[3]['value']
        value = Pt['fields'][0]
        percent = Pt['fields'][1]

        time = Pt['time']
# convert seconds since epoch into format required by influxdb
        pt_time = datetime.fromtimestamp(int(time), timezone.utc).isoformat('T', 'milliseconds')
# pull out the UTC hour
        ts=datetime.fromtimestamp(int(time)).astimezone(pytz.UTC)
        H = ts.strftime('%H')
        point = Point(measurement).tag("hostname",hostname).tag("itemid",itemid).tag("ltype",lineType).tag("item",item).tag("UTChour",H).field('value',value).field('percent',percent).time(pt_time)
        pts.append(point)
    write_api.write(bucket=bucket, org="poc_org", record=pts, write_precision=WritePrecision.S)

These scripts show how I accumulate a bunch of points and make an entry in influxdb once I have a bunch of them to make things go faster.

Complaints

I am not comfortable with the flux query documentation. Nor the Grafana documentation for that matter. They both give you a taste of what you need, without many details or examples. For instance it looks like there are multiple syntaxes available to you using flux. I just pragmatically developed what works.

Conclusion

I am content if no one reads this and it only serves as my own documentation. But perhaps it will help someone facing similar issues. Unfortunately one of the challenges is asking a good question to a search engine when you’re a newbie and haven’t mastered the concepts.

But without too too much effort I was able to master enough Grafana to create something that will probably be useful to our vendor management team. Grafana is fun and powerful. There is a slight lack of examples and the documentation is a wee bit sparse.

Categories
Python

Azure DevOps: use the api to copy logs to linux

Intro

As far as I can tell there’s no way to search through multiple pipeline logs with a single command. In linux it’s trivial. Seeing the lack of this basic functionality I decided to copy all my pipeline logs over to a linux server using the Azure DevOps (ADO) api.

The details

This is the main program which I’ve called get_raw_logs.py.

                    

#!/usr/bin/python3
# fetch raw log to local machine
# for relevant api section, see:
#https://learn.microsoft.com/en-us/rest/api/azure/devops/build/builds/get-build-log?view=azure-devops-rest-7.1
import urllib.request,json,sys
from datetime import datetime,timedelta
from modules import aux_modules

conf_file = sys.argv[1]

# pipeline uses UTC so we must follow suit or we will miss files
a_day_ago = (datetime.utcnow() - timedelta(days = 1)).strftime('%Y-%m-%dT%H:%M:%SZ')
print('a day ago (UTC)',a_day_ago)

#url = 'https://dev.azure.com/drjohns4ServicesCoreSystems/Connectivity/_apis/build/builds?minTime=2022-10-11T13:00:00Z&api-version=7.1-preview.7'

# dump config file into a dict
config_d = aux_modules.parse_config(conf_file)

url = config_d['url_base'] + config_d['organization'] + '/' + config_d['project'] + '/_apis/build/builds?minTime=' + a_day_ago + config_d['url_params']
#print('url',url)
req = urllib.request.Request(url)
req.add_header('Authorization', 'Basic ' + config_d['auth'])

# Get buildIds for pipeline runs from last 24 hours
with urllib.request.urlopen(req) as response:
   html = response.read()
txt_d = json.loads(html)
#{"count":215,"value":[{"id":xxx,buildNumber":"20221011.106","definition":{"name":"PAN-Usage4Mgrs-2"
value_l = txt_d['value']
for builds in value_l:
    buildId = builds['id']
    build_number = builds['buildNumber']
    build_def = builds['definition']
    name = build_def['name']
    #print('name,build_number,id',name,build_number,buildId)
    #print('this_build',builds)
    if name == config_d['pipeline1'] or name == config_d['pipeline2']:
        aux_modules.get_this_log(config_d,name,buildId,build_number)

In the modules directory this is aux_modules.py.

                    

import json
import os,urllib.request

def parse_config(conf_file):
# config file should be a json file
    f = open(conf_file)
    config_d = json.load(f)
    f.close()
    return config_d

def get_this_log(config_d,name,buildId,build_number):
# leaving out the api-version etc works better
#GET https://dev.azure.com/{organization}/{project}/_apis/build/builds/{buildId}/logs/{logId}?api-version=7.1-preview.2
#https://dev.azure.com/drjohns4ServicesCoreSystems/d6335c8e-f5b4-44a5-8f6c-7b17fe663a86/_apis/build/builds/44071/logs/7'
        buildId_s = str(buildId)
        log_name = config_d['log_dir'] + "/" + name + "-" + build_number
# check if we already got this one
        if os.path.exists(log_name):
            return
        #url = url_base + organization + '/' + project + '/_apis/build/builds/' + buildId_s + '/logs/' + logId + '?' + url_params
        url = config_d['url_base'] + config_d['organization'] + '/' + config_d['project'] + '/_apis/build/builds/' + buildId_s + '/logs/' + config_d['logId']
        print('url for this log',url)
        req = urllib.request.Request(url)
        req.add_header('Authorization', 'Basic ' + config_d['auth'])
        with urllib.request.urlopen(req) as response:
            html = response.read()
        #print('log',html)
        print("Getting (name,build_number,buildId,logId) ",name,build_number,buildId_s,config_d['logId'])
        f = open(log_name,"wb")
        f.write(html)
        f.close()

Unlike programs I usually write, some of the key logic resides in the config file. My config file looks something like this.

{
"organization":"drjohns4ServicesCoreSystems",
"project":"Connectivity",
"pipeline1":"PAN-Usage4Mgrs",
"pipeline2":"PAN-Usage4Mgrs-2",
"logId":"7",
"auth":"Yaskaslkasjklaskldslkjsasddenxisv=",
"url_base":"https://dev.azure.com/",
"url_params":"&api-version=7.1-preview.7",
"log_dir":"/var/tmp/rawlogs"
}

It runs very efficiently so I run it every three minutes.

In my pipelines, all the interesting stuff is in logId 7 so I’ve hardcoded that. It could have turned out differently. Notice I am getting the logs from two pipelines due to the limitation, discussed previously, that you can only run 1000 pipeline runs a week so I was forced to run two identical ones, staggered, every 12 minutes with pipeline-2 sleeping the first six minutes.

The auth is the base-64 encoded text for any:<my_auth_token>.

Conclusion

I show how to copy the logs over from Azure DevOps pipeline runs to a local Unix system where you can do normal cool linux commands on them.

References and related

Running an ADO pipeline more than 1000 times a week.

ADO Rest api reference section relevant for this post: https://learn.microsoft.com/en-us/rest/api/azure/devops/build/builds/get-build-log?view=azure-devops-rest-7.1

How to secure a sensitive variable in ADO.

Categories
Cloud

Azure DevOps: pipeline tips

Intro

I’ve made a lot of mistakes and learned something from every one. I am trynig to pass on some of what I learned.

Limit of 1000 pipeline runs per week

This seems pretty crazy. In my unit we were going all gung-ho and imagining Azure DevOps pipelines as an elegant replacement for running cron jobs on our own linux servers. But something as simple as “I need to run this job every five minutes” seems to be overwhelming for a pipeline. What? Yes there is a hard limit of 1000 pipeline jobs a week. This limit is discussed here.

I wanted a job to run every six minutes, which will still hit that limit. So what I am trynig is to create two pipelines. Each is scheduled to run every 12 minutes. The yaml files are almost the same except in the one I sleep for six minutes. I also needed to remember to re-create the pipeline variables I was using.

Getting the raw logs from your pipeline runs

I wrote a little script to get the raw logs and copy them to a linux filesystem where I can use the linux command-line tools I know and love to examine them in bulk. The main point I wish to share right now is that it is not at all obvious that you need to use the get builds section of the api, not the log section! Who would have guessed? https://learn.microsoft.com/en-us/rest/api/azure/devops/build/builds/get-build-log?view=azure-devops-rest-7.1

Errors I am seeing in my pipeline

[error] We stopped hearing from agent dsc-adosonar-drjohns4servicescoresystems-agent-549c476959-whd72. Verify the agent machine is running and has a healthy network connection. Anything that terminates an agent process, starves it for CPU, or blocks its network access can cause this error. For more information, see: https://go.microsoft.com/fwlink/?linkid=846610

We still need to figure this one out. The error appears only randomly.

I also saw a lot of more subtle errors which amounted to my variables not being defined correctly in the yaml section. Indentation is important! I had variables set up secret environment variables amongst other things. The behavior which results does not always make it obvious what the root cause is.

Don’t run the pipeline for every commit

In your commit comment, put

[skip ci]

somewhere on its own line. This will avoid that the pipeline runs each time you do a commit, which quickly gets annoying.

References and related

How to use the Azure DevOps api to for instance fetch the raw logs

Microsoft’s api documentation pertinent to this topic:

https://learn.microsoft.com/en-us/rest/api/azure/devops/build/builds/get-build-log?view=azure-devops-rest-7.1

Categories
Cloud Python

Azure DevOps: How to work in a subfolder of a project

Intro

Our repo corresponds to a project. Within it are subfolders corresponding to individual apps we want to run in a pipeline.

Someone provided me with a starter yaml file to run my code in a pipeline. Originally my code was running just fine on my own linux server. In the pipeline, not so much as it became apparent it was expecting the current working directory to be the subfolder (directory in linux-speak) for references to modules, config files, etc. The documentation is kind of thin. So I spent hours checking things out and creating the solution which I now present.

The solution

The critical thing is to set the workingDirectory. Here is that section of the yaml file.

 script: python getlogdata.py 'PAN4-5min.aql'
  displayName: 'Run script'
  workingDirectory: $(System.DefaultWorkingDirectory)/PAN_Usage_4_Mgrs
  env:
    AUTH_TOKEN: $(auth_token)
#    PYTHONPATH: $(System.DefaultWorkingDirectory)/PAN_Usage_4_Mgrs/modules

Note that that PYTHONPATH environment variable is another possible way out – if all you need is to include modules, but it won’t help you with other things like finding your config file.

Errors

Now suppose you see an error like I got:

ImportError: cannot import name 'ZabbixMetric' from 'jhpyzabbix' (unknown location).

I had tried to put jhpyzabbix folder at the same level as my subfolder, so, right under the top project level. At first I was getting module not found errors. So I put back my PYTHONPATH like so

    PYTHONPATH: $(System.DefaultWorkingDirectory)/PAN_Usage_4_Mgrs:$(System.DefaultWorkingDirectory)

And that’s when I got that cannot import name error. Whar caused that is that although I had copied over the needed .py files to jhpyzabbix, I forgot one whose purpose seemed irrelevant to me. __init__.py. Turns out that tiny python file is quite important after all. School of hard knocks… It sets up the namespace mapping, I guess. To be concrete, mine looks like this:

from .api import ZabbixAPI, ZabbixAPIException, ssl_context_compat
from .sender import ZabbixMetric, ZabbixSender, ZabbixResponse
References and related

Passing secure variable in Azure DevOps to your program

Categories
Cloud

Securing a sensitive variable in Azure DevOps

Intro

I’m a newbie at Azure DevOps. They sort of have to drag me kicking and screaming to give up the comfort of my personal linux to endure the torture of learning something brand new. But I iguess that’s the new way so I gotta go along with it. I work within a project that has a repo. And we intend to run a pipeline. Question is, how to deal with sensitive parameters like passwords?

One approach

If you just do raw Internet searches you can fall into any number of rabbit holes. Key vault, anyone? How about relying on secure files within the library? I guess – very tentatively – that using a key vault might be the right way to deal with this issue, but it all depends on the ACLs available and I do not know that.. I also do not see a simple-minded way to set up a key vault. So what the guy who seems to know a lot more than I do is to set up a hidden variable for the pipeline itself.

The thing is that even that has its own gotchas. I find that depending on where you start from, you may ior may not see the option to declare a pipline variable as hidden.

If I am in the Pipeline section and looking at Recently Run Pipelines, and click on my pipeline, before I run it I can add variables. Doing it that way, you only get the option to include a name and Value. No option for declaring it to be hidden.

So how do you do it?

Instead of Run Pipline > variables go to Edit Pipeline > Variables. Then variables you create will have the option to be kept secret.

It is a bad idea to pass a sensitive information on the command line. Anyone with access to that agent could do a process listing to read the secret. So instead you use an environment variable. It’s a little tricky as you have to understand yaml variable interpolation versus script interpolation.

My secret variable is auth_token for the Zabbix api token. So in my yaml file I have a reference to it which sets up an environment variable:

- script: python Delete_Unused_VEdges/activity_check.py 'SA'
  displayName: 'Run script'
  env:
    AUTH_TOKEN: $(auth_token)

And in activity_check.py I read that environment variable like this:

token_zabbix = os.environ['AUTH_TOKEN']

And, voila, it’s all good.

But how secure is it really?

Question is, is this just performative security? Because I’m working with a project. Anyone else with access to the project could alter my python program temporarily to print out the value of the secret variable. I’m assuming they can both alter the repo I use as well as run a pipeline. Yes, they make it a tiny bit challenging because you need to break up a secret variable into two pieces in order to print it out, but come on. So maybe we’ve mostly achieved security by obscurity.

But again, I don’t really know the richness of this environment and perhaps more can be done to restrict who can alter the repo and run it in a pipeline.

Conclusion

People who know more than I, a newbie, have sugested a certain approach to dealing with sensitive variables with our Azure DevOps pipelines. I have described here what I as a newbie see as potential pitfalls as well as how to implement their chosen method.