Monitoring by Zabbix: a working document

Intro
I panned Zabbix in this post: DIY monitoring. But I have compelling reasons to revisit it. I have to say it has matured, but there remain some very frustrating things about it, especially when compared with SiteScope (now owned by Microfocus) which is so much more intuitive.

But I am impressed by the breadth of the user base and the documentation. But learning how to do any specific thing is still an exercise in futility.

I am going to try to structure this post as a problems encountered, and how they were resolved.

Current production version as of this writing?
Answer: 4.4

Zabbix Manual does not work in Firefox
That’s right. I can’t even read the manual in my version of Firefox. Its sections do not expand. Solution: use Chrome

Which database?
You may see references to MYSQL in Zabbix docs. MYSQL is basically dead. what should you do?

Zabbix quick install on Redhat

Answer
Install mariadb which has replaced MYSQL and supports the same commands such as the mysql from the screenshot. On my Redhat instance I have installed these mariadb-related repositories:

mariadb-5.5.64-1.el7.x86_64
mariadb-server-5.5.64-1.el7.x86_64
mariadb-libs-5.5.64-1.el7.x86_64

Terminology confusion
what is a host, a host group, a template, an item, a web scenario, a trigger, a media type?
Answer

Don’t ask me. When I make progress I’ll post it here.

Web scenario specific issues
Can different web scenarios use different proxies?
Answer: Yes, no problem. In really old versions this was not possible. See web scenario screenshot below.

Can the proxy be a variable so that the same web scenario can be used for different proxies?
Answer: Yes. Let’s say you attach a web scenario to a host. In that host’s configuration you can define a “macro” which sets the variable value. e.g., the value of HTTP_PROXY in my example. I think you can do the same from a template, but I’m getting ahead of myself.

Similarly, can you do basic proxy auth and hide the credentials in a MACRO? Answer: I think so. I did it once at any rate. See above screenshot.

Why does my google.com web scenario work whereas my amazon.com scenario not when they’re exactly the same except for the URL?
Answer: some ideas, but the logging information is bad. Amazon does not take to bots hitting it for health check reasons. It may work better to change the agent type to Linux|Chrome, which is what I am trying now. Here’s my original answer: Even with command-line curl I get an error through this proxy. That can’t be good:
$ curl ‐vikL www.amazon.com

...
NSS error -5961 (PR_CONNECT_RESET_ERROR)
* TCP connection reset by peer
* Closing connection 1
curl: (35) TCP connection reset by peer


My amazon.com web scenario is not working (status of 1), yet in dashboard does not return any obvious warning or error or red color. Why? Answer:
no idea. Maybe you have to define a trigger?

Say you’re on the Monitoring|latest data screen. Does the data get auto-updated? Answer: yes, it seems to refresh every 30 seconds.

Why is the official FAQ so useless? Answer: no idea how a piece of software otherwise so feature-rich could have such a useless FAQ.

Zabbix costs nothing. Is it still actively supported? Answer: it seems very actively supported for some reason. Not sure what the revenue model is, however…

Can I force one or more web scenarios to be run immediately? I do this all the time in SiteScope. Answer: I guess not. There is no obvious way.

Suppose you have defined an item. what is the item key? Answer: no idea. But I’m seeing it’s very important…

What is the equivalent to SiteScope’s script monitors? Answer: Either ssh check or external check.

How would you set up a simple PING monitor, i.e., to see if your host is up? Answer: Create an item as a “simple check”, e.g., with the name ping this host, and the key icmpping[{HOST.IP},3]. That can go into a template, by the way. If it succeeded it will return a 1.

I’ve made an error in my script for an external check. Why does Latest data show nothing at all? Answer: no idea. If the error is bad enough Zabbix will disable the item on you, so it’s not really running any longer. But even when it doesn’t do that, a lot of times I simply see no output whatsoever. Very frustrating.

Why is the output from an ssh check truncated, where does the rest go? Answer: no idea.

How do you increase the information contained in the zabbix server log? Let’s say your zabbix server is running normally. Then run this command: zabbix_server ‐R log_level_increase
You can run it multiple times to keep increasing the verbosity (log level), I think.

Attempting to use ssh items with key authentication fails with :”Public key authentication failed: Callback returned error” Initially I thought Zabbix was broken with regards to ssh public key authentication. I can get it to work with password. I can use my public/private key to authenticate by hand from command-line as root. Turns out running command such as sudo -u zabbix ssh … showed that my zabbix account did not have permissions to write to its home directory (which did not even exist). I guess this is a case of RTFM, because they do go over all those steps in the manual. I fixed up permissions and now it works for me, yeah.

Where should the scripts for external checks go? In my install it is /usr/lib/zabbix/externalscripts.

Why is the behaviour of triggers inconsistent. sometimes the same trigger has expected behaviour, sometimes not. Answer: No idea. Very frustrating. See more on that topic below.

How do you force a web scenario check when you are using templates? Answer: No idea.

Why do (resolved) Problems disappear no matter how you search for them if they are older than, say, 30 minutes? Answer: No idea. Just another stupid feature I guess.

Why does it say No media defined for user even though user has been set up with email as his media? Answer: no idea.

Why do too many errors disable an ssh check so that you get Status Disabled and have no graceful way to recover? Answer: no idea. It makes sense that Zabbix should not subject itself to too many consecutive errors. But once you’ve fixed the underlying problem the only recovery I can figure is to delete the item and recreate it. or delete the host and re-create it. Not cool.

I heard dependent items are the way to go to parse complex data coming out of a rich text item. How do you do that? Answer: Yes they are. I have gotten them to work and really give me the fine-grained control I’ve always wanted. I hope to show a real-life example soon. To get started creating a dependent item you can right-click on the dots of an item, or create a new item and choose type Dependent Item.

I am looking at Latest data and one item is grayed out and has no data. Why? Answer: almost no idea. This happens to me in a dependent item formed by a regular expression where the regular expression does not match the content. I am trying to make my RegEx more flexible to match both good and error conditions.

Why do my dependent items, when running a Check Now, say Cannot send request: wrong data type, yet they are producing data just fine when viewed through Latest data? Answer: this happens if you ran a Check Now on your template rather than when viewing an individual host. Make sure you select a host before you run Check Now. Actually, even still it does not work, so final answer: no idea.

Why do some regular expressions check out just fine on regex101.com yet produce a match for value of type “string”: pattern does not match error in Zabbix? Answer: Some idea. Fancy regular expressions do not seem to work for some reason.

Every time I add an item it takes the absolute maximum amount of time before I see data, whether or not I run check Now until I turn blue in the face. Why? Answer: no idea. Very frustrating.

If the Zabbix server is in one time zone and I am in another, can I have my view of timestamps customized to my time zone? Otherwise I see all times in the timezone of the Zabbix server. Answer: You are out of luck. The suggestion is to run two GUIs, one in your time zone.

My DNS queries using net.dns don’t do anything. Why? Answer: no idea. Maybe your host is not running an actual Zabbiox agent? That’ll do it. Forget that net.dns check if you can’t install an agent. Zabbix has no agentless DNS monitor for some strange reason.

What does Check Now really do? Answer: essentially nothing as far as I tell. It certainly doesn’t “check now”, i.e., force the item to be run.

I have a lot of hosts I want to add to a template. Does that Mass Update feature actually work? Answer: yes. Use it. It will save you time.

Help! I accidentally deleted an entire template. I meant to just delete one of its macros. Is there a revert? Answer: it doesn’t look like it. Hope you remember what you did…

It seems if I choose units in an item which have too many characters, e.g., client connections, the graph cuts it off and doesn’t even display the scale? Answer: seems so. I’d call it a bug.

A word about SSH checks and triggers
Through the school of hard knocks I have learned that my ssh check is clipping the output from the executed command. So you know that partial data you see when you look at latest data, and thought it was truncating it for display purposes? Nuh, ah. That’s all you’re getting to go up against in your trigger, which sucks. It’s something like 260 characters. I got lucky in a sense to discover this early by running an ssh check against dns resolution of amazon.com. The response I got varied almost every 60 seconds depending on whether or not the response came out of the dns cache. So this was an excellent testbed to learn about the flakiness of triggers as well as waste an entire day.

Another thing about triggers with a regex. As far as I can tell the logic is reversed. So you think you’re defining the OK condition when you seek to match the output and have it given the value of 1. But instead try to match the desired output for the OK condition, but assign it a value of 0. I guess. Only that approach seems to work for me. And getting the regex to treat multiple lines as a unit was also a little tricky. I think by default it favored testing only against the last line.

So let’s say my output as scraped from Monitoring|Latest Data alternated between either

proxy1>test dns amazon.com
Performing DNS lookup for: amazon.com
 
DNS Response data:
Official Host Name: amazon.com
Resolved Addresses:
  205.251.242.103
  176.32.98.166
  176.32.103.205
Cache TTL: 1, cache HIT
DNS Resolver Response: Success

or

proxy1>test dns amazon.com
Performing DNS lookup for: amazon.com
 
Sending A query for amazon.com to 192.168.135.145.
 
Sending A query for amazon.com to 8.8.8.8.
 
DNS Response data:
Official Host Name: amazon.com
Resolved Addresses:
  20

, then here is my iregexp expression which seems to do the correct thing (treat both of these outcomes as successes):

{proxy1:ssh.run[resolve DNS,1.2.3.4,22,utf-8].iregexp("(?s)((205\.251\.|176\.32\.)|Sending A query.+\s20)")}=0

Note that the (?s) at the beginning helps, I think, to treat the newline character as just another character which matches “.”. I may have an extra set of parentheses around the outermost alternating expression, but I can only experiment so much…

I ran various tests such as to change just one of the numbers to make sure it triggered.

I now think I will get better, i.e., more complete, results if I make the item of type textrather than character, at least that switch definitely helped with another truncated output I was getting from another ssh check. So, yes, now I am capturing all the output. So, note to self, use type text unless you have really brief output from your ssh check.

So with all that gained knowledge, my simplified expression now reads like this:

{proxy:ssh.run[resolve a dns name,1.2.3.4,22,utf-8].iregexp("(205\.251\.|176\.32\.)")}=0

Here’s a CPU trigger. From a show status it focuses on the line:

CPU utilization: 29%

and so if I want to trigger a problem for 95% or higher CPU, this expression works for me:

CPU utilization:\s+([ 1-8]\d|9[0-4])\%

A nice online regular expression checker is https://regexr.com/

And a very simple PING test ssh check item, where the expected resulting line will be:

5 packets transmitted, 5 packets received, 0% packet loss

– for that I used the item wizard, altered what it came up with, and arrived at this:

(({proxy:ssh.run[ping 8.8.8.8,1.2.3.4,22,utf-8].iregexp("[45] packets received")})=0)

So I will accept the results as OK as long as at most one of five packets was dropped.

To be continued…

References and related
A good commercial solution for infrastructure monitoring: Microfocus SiteScope.

DIY monitoring

The Zabbix manual

A nice online regular expression (RegEx) checker is: https://RegExr.com/.

Another online regular expression checker is: https://RegExr.com/.

This entry was posted in Admin, Network Technologies and tagged , , . Bookmark the permalink.

Leave a Reply

Your email address will not be published. Required fields are marked *