Monitoring by Zabbix: a working document

Intro
I panned Zabbix in this post: DIY monitoring. But I have compelling reasons to revisit it. I have to say it has matured, but there remain some very frustrating things about it, especially when compared with SiteScope (now owned by Microfocus) which is so much more intuitive.

But I am impressed by the breadth of the user base and the documentation. But learning how to do any specific thing is still an exercise in futility.

I am going to try to structure this post as a problems encountered, and how they were resolved.

Current production version as of this writing?
Answer: 4.4

Zabbix Manual does not work in Firefox
That’s right. I can’t even read the manual in my version of Firefox. Its sections do not expand. Solution: use Chrome

Which database?
You may see references to MYSQL in Zabbix docs. MYSQL is basically dead. what should you do?

Zabbix quick install on Redhat

Answer
Install mariadb which has replaced MYSQL and supports the same commands such as the mysql from the screenshot. On my Redhat instance I have installed these mariadb-related repositories:

mariadb-5.5.64-1.el7.x86_64
mariadb-server-5.5.64-1.el7.x86_64
mariadb-libs-5.5.64-1.el7.x86_64

Terminology confusion
what is a host, a host group, a template, an item, a web scenario, a trigger, a media type?
Answer

Don’t ask me. When I make progress I’ll post it here.

Web scenario specific issues
Can different web scenarios use different proxies?
Answer: Yes, no problem. In really old versions this was not possible. See web scenario screenshot below.

Can the proxy be a variable so that the same web scenario can be used for different proxies?
Answer: Yes. Let’s say you attach a web scenario to a host. In that host’s configuration you can define a “macro” which sets the variable value. e.g., the value of HTTP_PROXY in my example. I think you can do the same from a template, but I’m getting ahead of myself.

Similarly, can you do basic proxy auth and hide the credentials in a MACRO? Answer: I think so. I did it once at any rate. See above screenshot.

Why does my google.com web scenario work whereas my amazon.com scenario not when they’re exactly the same except for the URL?
Answer: some ideas, but the logging information is bad. Amazon does not take to bots hitting it for health check reasons. It may work better to change the agent type to Linux|Chrome, which is what I am trying now. Here’s my original answer: Even with command-line curl I get an error through this proxy. That can’t be good:
$ curl ‐vikL www.amazon.com

...
NSS error -5961 (PR_CONNECT_RESET_ERROR)
* TCP connection reset by peer
* Closing connection 1
curl: (35) TCP connection reset by peer


My amazon.com web scenario is not working (status of 1), yet in dashboard does not return any obvious warning or error or red color. Why? Answer:
no idea. Maybe you have to define a trigger?

Say you’re on the Monitoring|latest data screen. Does the data get auto-updated? Answer: yes, it seems to refresh every 30 seconds.

Why is the official FAQ so useless? Answer: no idea how a piece of software otherwise so feature-rich could have such a useless FAQ.

Zabbix costs nothing. Is it still actively supported? Answer: it seems very actively supported for some reason. Not sure what the revenue model is, however…

Can I force one or more web scenarios to be run immediately? I do this all the time in SiteScope. Answer: I guess not. There is no obvious way.

Suppose you have defined an item. what is the item key? Answer: no idea. But I’m seeing it’s very important…

What is the equivalent to SiteScope’s script monitors? Answer: Either ssh check or external check.

How would you set up a simple PING monitor, i.e., to see if your host is up? Answer: Create an item as a “simple check”, e.g., with the name ping this host, and the key icmpping[{HOST.IP},3]. That can go into a template, by the way. If it succeeded it will return a 1.

I’ve made an error in my script for an external check. Why does Latest data show nothing at all? Answer: no idea. If the error is bad enough Zabbix will disable the item on you, so it’s not really running any longer. But even when it doesn’t do that, a lot of times I simply see no output whatsoever. Very frustrating.

Why is the output from an ssh check truncated, where does the rest go? Answer: no idea.

How do you increase the information contained in the zabbix server log? Let’s say your zabbix server is running normally. Then run this command: zabbix_server ‐R log_level_increase
You can run it multiple times to keep increasing the verbosity (log level), I think.

Attempting to use ssh items with key authentication fails with :”Public key authentication failed: Callback returned error” Initially I thought Zabbix was broken with regards to ssh public key authentication. I can get it to work with password. I can use my public/private key to authenticate by hand from command-line as root. Turns out running command such as sudo -u zabbix ssh … showed that my zabbix account did not have permissions to write to its home directory (which did not even exist). I guess this is a case of RTFM, because they do go over all those steps in the manual. I fixed up permissions and now it works for me, yeah.

Where should the scripts for external checks go? In my install it is /usr/lib/zabbix/externalscripts.

Why is the behaviour of triggers inconsistent. sometimes the same trigger has expected behaviour, sometimes not. Answer: No idea. Very frustrating. See more on that topic below.

How do you force a web scenario check when you are using templates? Answer: No idea.

Why do (resolved) Problems disappear no matter how you search for them if they are older than, say, 30 minutes? Answer: No idea. Just another stupid feature I guess.

Why does it say No media defined for user even though user has been set up with email as his media? Answer: no idea.

A word about SSH checks and triggers
Through the school of hard knocks I have learned that my ssh check is clipping the output from the executed command. So you know that partial data you see when you look at latest data, and thought it was truncating it for display purposes? Nuh, ah. That’s all you’re getting to go up against in your trigger, which sucks. It’s something like 260 characters. I got lucky in a sense to discover this early by running an ssh check against dns resolution of amazon.com. The response I got varied almost every 60 seconds depending on whether or not the response came out of the dns cache. So this was an excellent testbed to learn about the flakiness of triggers as well as waste an entire day.

Another thing about triggers with a regex. As far as I can tell the logic is reversed. So you think you’re defining the OK condition when you seek to match the output and have it given the value of 1. But instead try to match the desired output for the OK condition, but assign it a value of 0. I guess. Only that approach seems to work for me. And getting the regex to treat multiple lines as a unit was also a little tricky. I think by default it favored testing only against the last line.

So let’s say my output as scraped from Monitoring|Latest Data alternated between either

proxy1>test dns amazon.com
Performing DNS lookup for: amazon.com
 
DNS Response data:
Official Host Name: amazon.com
Resolved Addresses:
  205.251.242.103
  176.32.98.166
  176.32.103.205
Cache TTL: 1, cache HIT
DNS Resolver Response: Success

or

proxy1>test dns amazon.com
Performing DNS lookup for: amazon.com
 
Sending A query for amazon.com to 192.168.135.145.
 
Sending A query for amazon.com to 8.8.8.8.
 
DNS Response data:
Official Host Name: amazon.com
Resolved Addresses:
  20

, then here is my iregexp expression which seems to do the correct thing (treat both of these outcomes as successes):

{proxy1:ssh.run[resolve DNS,1.2.3.4,22,utf-8].iregexp("(?s)((205\.251\.|176\.32\.)|Sending A query.+\s20)")}=0

Note that the (?s) at the beginning helps, I think, to treat the newline character as just another character which matches “.”. I may have an extra set of parentheses around the outermost alternating expression, but I can only experiment so much…

I ran various tests such as to change just one of the numbers to make sure it triggered.

I now think I will get better, i.e., more complete, results if I make the item of type textrather than character, at least that switch definitely helped with another truncated output I was getting from another ssh check. So, yes, now I am capturing all the output. So, note to self, use type text unless you have really brief output from your ssh check.

So with all that gained knowledge, my simplified expression now reads like this:

{proxy:ssh.run[resolve a dns name,1.2.3.4,22,utf-8].iregexp("(205\.251\.|176\.32\.)")}=0

And a very simple PING test ssh check item, where the expected resulting line will be:

5 packets transmitted, 5 packets received, 0% packet loss

– for that I used the item wizard, altered what it came up with, and arrived at this:

(({proxy:ssh.run[ping 8.8.8.8,1.2.3.4,22,utf-8].iregexp("[45] packets received")})=0)

So I will accept the results as OK as long as at most one of five packets was dropped.

To be continued…

References and related
A good commercial solution for infrastructure monitoring: Microfocus SiteScope.

DIY monitoring

The Zabbix manual

Posted in Admin, Network Technologies | Tagged , | Leave a comment

Consumer tech: Samsung touchscreen not responsive

Intro
This may be really obvious to some, but I did not realize until recently that you can adjust the touch screen sensitivity on a Samsung smartphone. Who knew?

Why it matters
I have a family member with a glass screen protector. I watched her struggle tapping and swiping multiple times to get it to respond. No one mentioned a way to fix it. Then i helped apply a new glass screen protector after the old one broke and i did an RTFM. The directions mentioned you can increase the touch sensitivity in settings. It’s no longer where it said. but in Display settings.

Posted in Consumer Tech | Leave a comment

Getting GNU screen to work on Windows 10 for a productive terminal multiplex environment

Intro
My jump server is getting old and they’re threatening to cut it off. A jump server is a server from which you launch CLI terminal sessions into your linux servers. Since my laptop has firewall access to all the same servers I wondered if I could build up a productive environment right within Windows 10 on my own laptop. For me this would be running GNU screen as a terminal multiplexer since I hop between terminal screens all day.

More details
Windows 10 is coming around to more fully integrating with Linux! it’s about time. WSL, windows subsystem for Linux, is all about that. And things like bash shell, ubuntu and OpenUSE Linux are available from the windows store. But that was not an option for me. My organizaiton has shut all that down.

So I thought back to my days as a Cygwin user those many years ago… Could I get GNU screen running within Cygwin environment on Windows 10? Well, yes, I can with just a few tweaks.

I think the initial Cygwin install required admin privileges, but once installed to run it does not.

Within Cygwin screen is an optional package and you can run their setup program to search and install it.

Here is my .screenrc file

defscrollback 4000
#change init sequence to not switch width
termcapinfo  xterm Z0=\E[?3h:Z1=\E[?3l:is=\E[r\E[m\E[2J\E[H\E[?7h\E[?1;4;6l
 
# Make the output buffer large for (fast) xterms.
termcapinfo xterm* OL=10000
 
# tell screen that xterm can switch to dark background and has function
# keys.
termcapinfo xterm 'VR=\E[?5h:VN=\E[?5l'
termcapinfo xterm 'k1=\E[11~:k2=\E[12~:k3=\E[13~:k4=\E[14~'
termcapinfo xterm 'kh=\E[1~:kI=\E[2~:kD=\E[3~:kH=\E[4~:kP=\E[H:kN=\E[6~'
 
# special xterm hardstatus: use the window title.
termcapinfo xterm 'hs:ts=\E]2;:fs=\007:ds=\E]2;screen\007'
 
#terminfo xterm 'vb=\E[?5h$<200/>\E[?5l'
termcapinfo xterm 'vi=\E[?25l:ve=\E[34h\E[?25h:vs=\E[34l'
 
# emulate part of the 'K' charset
termcapinfo   xterm 'XC=K%,%\E(B,[\304,\\\\\326,]\334,{\344,|\366,}\374,~\337'
 
# xterm-52 tweaks:
# - uses background color for delete operations
termcapinfo xterm ut
 
#from https://stackoverflow.com/questions/359109/using-the-scrollwheel-in-gnu-screen
termcapinfo xterm* ti@:te@
 
escape ^\\ # changes espace sequence
password

Note that in my .screenrc I use <Ctrl-\> as my escape sequence, so, e.g., to pop to the previous screen it is <Ctrl-\> <Ctrl-\>. I’m not sure that’s standard but my fingers will remember that to my dying day. They probably still remember some of those EDT/TPU VAX editor commands to this day!

Compare and contrast
Here are my day 0 observations.

ssh, curl, nslookup and tracert are coming from the underlying Windows system (do a which curl to see that) so that means you get the dumb version your system has.

So there is no dig, and no nc or netcat.

touch, cat, mkdir and vi behave pretty normally. man pages are installed, which can be a help.

if you use proxy, a funny thing can happen and your environment variables can get mixed. You may have inherited an HTTP_PROXY environment variable form the system, but the alias you copied from a linux jump server probably defines an http_proxy environment variable (lower case). And both can co-exist! As to which one curl would then use, who knows? Better just stick to working with the upper-case one and NOT define another in lower case.

For awhile it looked like scrolling was not working at all when screen was running. Then i found that tip I reference at the bottom of my .screenrc file which makes scrolling work via the mouse’s scroll wheel, which isn’t too bad.

Old friends like ls, grep, echo and while (built-in bash command) are available however. dig can be installed from the bind-utils package.

A lot of other packages are optionally available, including a whole X-Windows environment, which I used to run in the past but hope to avoid this time around.

No crontabs however (to have cron daemon requires installing admin privileges) which kind of hurts.

Simple output redirection seems to work, as does job control, e.g.,

ping -t 8.8.8.8 > /dev/null 2>&1 & 

Not sure why you’d want to run the above command, but this nice example shows that the /dev/null device exists, and the ping command is inherited from your Windows environment hence the -t option to run it indefinitely, and that it will create a background process which you can view and control with jobs / kill.

Now I typically move my laptop off the work environment each night, so all my ssh logins will be lost, unlike the jump server situation. But our jump server isn’t that stable anyway so no big loss I’d say…

I am sooo used to highlighting text in Teraterm, which is my current environment, and that being sufficient to put that text into the clipboard, that I keep doing that in this environment. But it doesn’t work. I have to use the CMD window convention of highlighting the text and then hitting ENTER to get it into the clipboard. oops. That was because I had been launching Cygwin from a CMD window. Now I am launching from a proper Cygwin shortcut and simple text highlighting works, BUT, right-clicking to paste it in brings up a menu rather than just doing it! So there’s that different now…

A word on package management
I don’t know why I was afraid of installing packages when I first tried Cygwin over a decade ago. Now for me that’s the key – to understand and practice installing packages because it’s actually really easy when you’re used to it.

The key is to simply keep your initial install setup hanging around, setup-x86_64.exe. In my case it’s in my downloads directory. Example usage: I wondered if I could install a decent version of ping rather than continually suffer with the dumb DOS version. So, fire up the above-mentioned executable. Go through a few screens (where it remembers the answers from the initial install), then search for the package (Yes, it’s there!), and select to install the most recent version from the drop-down. A few more clicks and it’s done and available in your path. it’s that easy… Not sure about uninstalling because you almost never need to do that. It seems maybe a thousand packages are available? so no, there’s no yum or zypper or rpm or apt-get, but who really needs those anyway?

As a concrete example, I am learning about SNMP. So I got something running on a Bluecoat proxy, and I wanted to see what I could see. The guide recommended using snmpwalk, which of course I did not have. So I learned which package it is in with a DDG search, then ran the Cygwin setup, found that package, installed it, and voila, there was snmpwalk in my path. And it worked, by the way. Easy peasy.

Creating your own scripts

If you have the funny situation, like me, where you had enough privileges to install Cygwin, perhaps by temporarily assigning your account the Admin role, but when you use it day-to-day, you do not have admin privileges, you will find yourself unable to create files in some of the system directories like /usr/local/bin – permission denied! But in your home directory you will be able to edit files.

So what I did is to create a bin directory under my home directory, where I plan to add my home-grown scripts such as mimeencode, and make sure my PATH includes this directory with a statement like

 export PATH=$PATH:${HOME}/bin

which I put in my .alias file, which in turn I source from .bashrc.

Conclusion
GNU screen for Windows is indeed possible, but you gotta run it on top of Cygwin. It’s of interest that after all these years Cygwin is still viable o Windows 10. Cygwin can be run in a pretty lightweight fashion if you avoid the X-Windows stuff. There are some quirks but it is surprisingly linux-like at the end of the day. I believe it is really suitable as a replacement for a linux jump server. screen, for the uninitiated, is a temrinal multiplexer, which means it makes it very fast for you to switch between multiple terminal windows.

Some things are a bit different.

I think I will use this both at work and at home…

If you have access, a look at WSL and native bash might be worthwhile.

References and related

Here’s the GNU Cygwin home page: https://www.cygwin.com/

Install Cygwin by running https://www.cygwin.com/setup-x86_64.exe

Interesting discussion: https://stackoverflow.com/questions/359109/using-the-scrollwheel-in-gnu-screen

If you have a linux jump server that runs screen, or just want to ssh to a linux server, teraterm can be a good choice (as opposed to putty or built-in ssh). These days it can be found here: https://osdn.net/projects/ttssh2/releases/

Posted in Admin, Linux | Tagged , , , , , | Leave a comment

Azure Cloud: can you swap public IP addresses on two VMs?

Intro
I am just beginning to use Microsoft’s Azure cloud environment. Although I am inclined to be a fan of AWS, I haven’t looked at AWS networking for awhile, and the last time I did something I felt totally lost in trying to understand their terminology.

But in spite of my natural inclination to support everything Amazon, I gotta admit that Azure was a good, usable environment for what I needed to do; swap the public IPs on two VMs.

The details
I was not getting any help whatsoever from with my organization. But I did at least get sufficient access to the Resource Group where my Redhat 7.4 VM was running. That was a godsend.

In Azure network interfaces are resources. They have IPs like 10.0.1.4, 10.0.1.7, etc.

Public IP addresses are resources. They have IPs apprporiate for your region. They are typically associated with a network interface.

A network interface in turn is associated to a VM, typically.

For some reason which no one could explain to me, I could no longer patch my RHEL 7.4 server. That began about September 2019. Meantime, I was using an application which relied on a built-in package. Now Redhat always ships with old versions of everything, so running this old version plus lack of patches really put pressure on me to upgrade to a new OS. Can you do an in-place upgrade? As far as I can tell, no. I went with SLES 15 SP1 on a new VM within the same resource group and data center since I have some familiarity with Suse Linux. That OS had a newer version of that open source package, plus it could be patched.

But the IP of the old server was embedded in several places and switching it was not an option. What to do? What to do? Can you even swap IPs on two VMs within the same Resource Group? Who knows?

Well, it turns out you can. The documentation on the topic is pretty good and cleared up some things for me. Particularly the first two links in the references at the bottom.

I took this approach.

Changed the IP from dynamic to Static (go to configure section when looking at this resource). This should have been done from the get-go, but wasn’t. Who knew?

Dissociate the IP from the network interface.

Changed the second IP from dynamic to static.

Dissociate this IP from its network interface.

Associate IP to network interface of the SLES 15 server.

Associate second IP to network interface of the RHEL server.

And that’s it…. It worked like a charm.

Then I cleaned up some old public IP addresses which weren’t being used. You have to remember there is a shortage of IPs. So a lot of the quirk you encounter are due to their utilizing Ips as sparingly as possible. makes sense to me. For instance you can have a “public IP” resource which has no value! it may not get a value until its absolutely needed by virtue of being associated with a network interface on an active server. Stuff like that…

Conclusion
Yes, you can indeed swap public IPs on two servers if they belong to the same Resource Group and I guess the same data center. I know because I did it. As a bonus I found that the Azure documentation is pretty clear and sufficiently detailed.

References and related
https://docs.microsoft.com/en-us/azure/virtual-network/virtual-network-ip-addresses-overview-arm
https://docs.microsoft.com/en-us/azure/virtual-network/virtual-network-network-interface-addresses

Posted in Admin | Tagged , , , , | Leave a comment

The IT Detective Agency: the case of the Is it the firewall? or routing? or switch? or layer 2?

Intro
This is yet another tale of things in the IT world often do not turn out the way it seems at first blush. Or possibly a tale of just when you think you’ve seen it all after decades in the industry, something new (to you) occurs.

What’s going on
The firewall team was all busy so when this strange problem occurred Friday they called in the second string: me. I consider some of the team to be less-than-customer focused so I try to compensate for them and for my lack of knowledge about the firewall by applying a more customer-first attitude. In other words, a sympathetic listening ear. These days it can be hard just to find someone to complain to about your It problem, and I am keenly aware of that.

There was some strange communication which wasn’t working, mediated by a firewall I had never accessed and was not sure i even had access to. So of course I was asked to join a big conference call where an ongoing debugging session was taking place.

I refused.

I hate being blindsided, and i hate not having answers, making me sound even less competent than I already am.

But what I did do is being my research to see what the system is, if perhaps I had access, etc.

Yes. I found that through a management system I have access to I had access to view the policies on that particular firewall and view the logs as well.

So once I had that up, I agreed to join the call.

They had one server communicating to three different systems. Only one of the three systems was being reached. Yes the other two were on the same subnet. Two of our firewalls were between the system and the three servers.

And, yes, i could see some drops. The interesting TCP error stated: TCP packet out of state, first packet isn’t SYN.

No problem. routing must be screwed up such that we have asymmetric routing. It happens all the time. Right? well these systems are really appliances with only some basic networking information configurable, not real debugging facility, and really no ability to add a host route.

I could not establish a shell session onto the firewall – not sure what the password naming scheme was that they used.

Then a real firewall guy comes on the call. But his connectivity is messed up, so I keep with the debug session, if nothing else than to support him since four eyes is more effective than just two. He shares the routing tables of our two in-line firewalls. It’s hard to understand as these are all new subnets for me, some are ones that don’t look right. But just focusing on possible host routes for any of these three servers, I don’t see anything amiss.

Firewall policy
And, in firewall policy I see the entire subnet has this traffic permitted. There is no rule specific to one or the other of these systems.

So what do we have up until now?
A purist firewall administrator attitude would be as follows:
The firewall treats all these systems the same, therefore this cannot be a firewall problem. Talk to your networking or system people. Have a nice day.

Well, in fact there was some serious question about the network switch as well. So we had a network guy on the call. So they dug up the MAC addresses of these systems, from which they found the switch ports. Then they checked the port configuration. Ah, some complex 802.1x authentication was configured. As I understand this means the device would not even be allowed onto the subnet until it passed some kind of Radius authentication. So they removed this 802.1x stuff and just made sure that port was assigned to the right vlan.

Still, the problem persisted.

I think the other firewall guy was also new to this equipment. Eventually, though, he tries to do a packet trace of the one that’s working versus the one that isn’t.

You know, I never saw the results of those traces, but I’m pretty sure, reading between the lines, that they surprised him, meaning, they did not fit the hypothesis of the asymmetric routing.

In these situations there is the main communication in the mian session, then side communications going on, like between me and the firewall guy. But it is all chaotic. Acoustics are mediocre, accents are hard to understand. So the net transfer of information is pretty low. Statements, even important ones, often have to be repeated multiple times (rebroadcasts) to assure everyone “gets it.”

Typical questions were asked. When did this last work? what had changed? There were a couple changes. Some kind of networking thing (I forget what), and then the firewalls changed management systems after that. The firewall change seemed closer in time to the last known success.

You acquire more and more information as you dig into problems. It’s hard to judge which is relevant at the time and which lines if inquiry are a complete waste of time. A good incident manager or project manager can sense which are the more productive lines of investigation and nurture those discussions while suppressing the noise.

Actually it was the networking guy who found the Checkpoint link below. I looked at it. the firewall guy was muttering something about badly behaved, older applications that might exhibit this behaviour.

So we agreed to take the suggested steps, which would basically allow these out of state packets. Drat. The firewall returned an error.

But I continue to refresh the firewall logs. The communication was occurring about every minute. Lo and behold, I see the older drops, and then accepts for the last few minutes! I think it worked. I tell them to check.

They check their end. Sure enough. Communication beginning to work…

The customer tries to make assertion that this was a firewall problem all along. Not so fast. Firewall guy says, well, the firewall is doing exactly what it’s supposed to be doing. who’s right?

We’re all good for now, but we state this is a kludge for today and a follow-up meeting needs to occur.

So what happened?
I think the single most important thing is that the firewall guy switched his problem hypothesis from Must be asymmetric routing, to Maybe it’s a badly behaved application. Meaning what? What if you have an application that establishes a TCP connection, and then to beat idle timeouts, sends a KEEP ALIVE packet every minute? Well, now, suppose your firewall is rebooted in the middle of that because it has changed management stations and needs to reload policy? What might the situation look like to it?

It you were unlucky, it just might see these KEEP ALIVE TCP packets without having the connection in its connection table, in other words, exactly the situation we are observing!

What should have happened?
It would have been great if the communication were forced to be re-established form time-to-time, even once a day. This problem had been going on for days.

But, given this very stupid behaviour on the part of this application, if the app people had been aware they should have forced their application to re-establish the TCP connection after the firewall reboot. Probably, for the one that did work, it had been forced to re-establish.

A firewall person has to be sufficiently aware to realize this could be happening, and advise the app owner on what to do to prevent it.

Conclusion
So whose problem is it?

To the app people it looks like a firewall issue, cut-and-dried. To a firewall guy it looks like an application issue, cut-and-dried. I see both sides. It is some of both. An app owner has to understand enough about firewalls to see that this type of thing can occur. Assigning blame to one side or the other, as most people are wont to do, is not productive. Only a team effort could have revealed this issue. And recall that the “fix” is actually a kludge that lowers security.

Case: almost closed.

References and related
Checkpoint’s note on TCP packet out of state first packet isn’t SYN: https://community.checkpoint.com/t5/General-Topics/TCP-packet-out-of-state-First-packet-isn-t-SYN-tcp-flags-SYN-ACK/td-p/37166

The IT Detective agency cases are still coming fast and furious. Here’s another recent case. Failed to convert character

Posted in Admin, Network Technologies | Tagged | Leave a comment

The IT Detective Agency: WebEx and the case of the mysterious reset

Intro
A company known to me was a contented user of WebEx until they noticed a strange behaviour: their calls were losing quality or even dropped exactly after one hour. No one, most especially the vendor of WebEx, had the slightest idea of the root cause. Read on to see how this fascinating case is playing out.

Scene: company offices, Sao Paolo, Brazil
Triago, a very competent IT professional located in Sao Paolo was the first to report the problem. More-or-less it went like this:

– when he uses the call-my-computer feature in WebEx the call quality is fine, until he has been on the meeting for one hour. At the one hour mark the voice quality of others (from his perspective) dropped dramatically. Sometimes the call was completely lost. Then, about five minutes later, the quality was OK again.
– the problem only occurred when in the office or using VPN, i.e., when using the company network
– others in South America are having the same issue
– he can use the same company laptop, on the Internet, and will not have the problem

After the usual finger-pointing amongst various vendors a debugging plan was created.

It’s going really well. There have been about 12 test calls, stretched out over the last five months. You have to admire the chutzpah of US software vendors who sell to major customers and then still manage to treat them like crap come time for support.

The pattern, more-or-less, goes like this. test call with several vendors plus Triago and I in the US. Wait around for an hour, produce the problem. Wait for software vendor to “analyze”. Wait for two weeks. Some small insight may be gleaned by them. Conclusion: another test is needed, we didn’t have all the traces we need. rinse and repeat.

Scene: A soulless office park somewhere in northern New Jersey
To be continued, literally…

Scene: an enterprise-class server room somewhere in Research triangle Park, North Carolina
If one uses the company’s guest WiFi, one uses the company’s firewall, but not the company’s proxy server. This test succeeds. But, unfortunately, one also uses UDP rather than TCP for the communication because that is the default. See the references for communication requirements.

So one thought is to knock out the ability to use UDP by blocking UDP port 9000, thereby forcing use of TCP. Testing that today….

References and related
Networking requirements for WebEx: https://help.webex.com/en-us/WBX264/How-Do-I-Allow-Webex-Meetings-Traffic-on-My-Network

Posted in Admin, Network Technologies | Tagged | Leave a comment

What I’m Working on Now: building my own 3D printer

Intro
Because someone else on the team had a good experience with it, and so i can exchange notes and get some tips form them, I went ahead and ordered the Anet A8 3D printer + an extra roll of PLA. That is, despite the vocal negative reviews which exist.

Although normally I don’t consider myself very mechanical, I guess I’m pretty good at following instructions. So I’m watching their detailed Youtube video and mimicking their actions, step-by-step. And I’m having a blast! It’s like having an erector set all over again.

And I’m in constant awe that so much could be had for so little. There must be, what, a couple hundred parts, including five stepper motors, several limit switches, a circuit board, lcd, power supply, steel rods, belts, arylic(?) parts, and 0.5 Kg of PLA – all for $140?? I feel like its raw value is several times that. Hey, I just bought two m5 screws and nuts from Home Depot and paid almost $3.

Not exactly like the video

Things were going great, until the kit actually deviated form the parts shown in the video! Especially where the kit provided a two 3D-printed parts shaped significantly differently from what they show. Assembly slowed down after that.

The rods did not fit through the 3D parts as they should have – I had to bore them out a bit with one of the yet-unused threaded rods that came with the kit! Which did work by the way + a lot of muscle and hand strength.

Partially assembled – extruder not yet inserted.

Many gray hairs later I finished the assembly. If you’re familiar with furniture assembly of a semi-complex piece like a desk with top shelves, this is about three times more complicated. You keep hoping that OK, now it’ll get easier and I’ll cruise through this. But it never does! Every stage features unique steps presenting their own unique challenges.

In my case my cooling fan on the extruder, which looks suspiciously like a 3D part, hangs below the extruder nozzle, so I don’t think I can use it.

Then once it was assembled my first test print, a simple box, went awry. Around the third layer the whole thing started moving around on the print bed! Once again I was hoping that the hard part was the assembly and then I could cruise to printing to my heart’s content. Wrong. Now comes a whole new set of challenges unique to 3D printing, and you have to master that as well. This is nothing at all like going to Staples to get an ink jet printer where the most challenging thing you’ll have to do is change an ink cartridge. Nothing at all like that. This is more like constructing a model railroad yourself.

Things get smoother

So I took everyone’s advice and installed Cura 3D as my slicer. Then I just decided to go for it and print out my first upgraded part: a center nozzle fan. It worked really, really well!

My very first print – a center nozzle fan

You can almost see form the picture that the quality was really good. I had to sandpaper the chute a little for it to fit – apparently ABS sandpapers easily – and voila, it snapped into place.

Next I printed a filament guide. Also no problems there.

Then my filament broke.

Some terms
slicer – software which takes an STL file an translates it into a series of layer-by-layer movements. Cura 3D is the slicer I use.
gcode – the file format that the anet a8 printer understands. You take an STL file, put it into your slicer, and have it produce a gcode file for you that you print.
PLA – the type of plastic most often used for 3D printing at home

References and related
Amazon link to what I purchased – now only $135! https://smile.amazon.com/gp/product/B01N5D2ZIB/ref=ppx_yo_dt_b_asin_title_o01_s00?ie=UTF8&psc=1

Assembly video, part 1
Part 2.
Anet test guide: part 3: https://www.youtube.com/watch?v=EB5Q3_sJ-Tk

Someone’s additional first steps guide – has some useful tips: https://www.instructables.com/id/Beginners-Guide-to-3D-Printing-Anet-A8-DIY-3D-Prin/

Note that the included microSD card has PDFs for assembly instructions, usage instructions and troubleshooting guide.

Upgrade part: center nozzle fan download: http://www.thingiverse.com/thing:1620630

I am contemplating using openscad to build my 3D models. www.openscad.org. This is really good tutorial: http://www.tridimake.com/2014/09/how-to-use-openscad-tricks-and-tips-to.html

Posted in Admin | Tagged | Leave a comment

The IT Detective Agency: the case of Failed to convert character

Intro
A user of a web form noticed any password that includes an accented character is rejected. He came to use as the operator of the web application firewall for a fix.

More details
The web server was behind an F5 device running ASM – application security manager. The reported error that we saw was Failed to convert character. What does it all mean?

One suggestion is that the policy may have the wrong language, but the application language of this policy is unicode (utf-8), just like all our others we set up. And they don’t have any issues. I see where I can remove the block on this particular input violation, but that seems kind of an extreme measure, like throwing out the baby with the bathwater.

I wondered about a more granular way to deal with this?

Check characters on this parameter value is already disabled I notice, so we can’t further loosen there.

Ask the expert
So I ask someone who speaks a foreign language and has to deal with this stuff a lot more than I do. He responds:

Looking at the website I think that form just defaults to ISO-8859-1 instead of UTF-8 and that causes your problem.
Umlauts or accented letters are double byte encoded in UTF-8 and single byte in ISO-8859-1

To confirm the problem with the form, he enters an “ä” as the username, which the event log shows encoded to %E4 which is not a valid UTF-8 sequence.

Our takeaway
To repeat a key learning from this little problem:
Umlauts or accented letters are double byte encoded in UTF-8 and single byte in ISO-8859-1

So the web form itself was the problem in this case; and I went back to the user/developer with this informatoin.

So he fixed it?
Well, turns out his submission form was a private page he quickly threw together to test another problem, the real problem, when he noticed this particular issue.

So, yes, his form needed to mention utf-8 if he were going to properly encode accented characters, but that did not resolve the real issue, which remains unresolved.

It happens that way sometimes.

But, yes, the problem reported to us was resolved by the developer based on our feedback, so at least we have that success.

Conclusion
If like me, your eyes glaze over when someone mentions ISO-8859-1 versus UTF-8, the differences are pretty stark, easy-to-understand, and, just sometimes, really, important! I think ISO-8859-1 will represent some of the popular accented characters in positions 128 – 255, but not utf-8. utf-8 will use additional bytes to represent characters outside of the Latin alphabet plus the usual special characters.

We’ll call this one Case Closed!

References and related
I like to do a man ascii on any linux system to see the representation of the various Latin characters. I had to install the man-pages package on my RHEL system before that man page was available on my system.

Posted in Network Technologies, Web Site Technologies | Tagged | Leave a comment

How to POST with curl

Intro
For the hard-core curl fans I find these examples useful.

Example 1
Posting in-line form data, e.g., to an api:

$ curl ‐d 'hi there' https://drjohns.com/api/example

Well, that might work, but I normally add more switches.

Example 2

$ curl ‐iksv ‐d 'hi there' https://drjohns.com/api/example|more

Perhaps you have JSON data to POST and it would be awkward or impossible to stuff into the command line. You can read it from a file like this:

Example 3

$ curl ‐iksv ‐d @json.txt https://drjohns.com/api/example|more

Perhaps you have to fake a useragent to avoid a web application firewall. It actually suffices to identify with the -A Mozilla/4.0 switch like this:

Example 4

$ curl ‐A Mozilla/4.0 ‐iksv ‐d @json.txt https://drjohns.com/api/example|more

Suppose you are behind a proxy. Then you can tack on the -x switch like this next example.

Example 5

$ curl ‐A Mozilla/4.0 ‐x myproxy:8080 ‐iksv ‐d @json.txt https://drjohns.com/api/example|more

Those are the main ones I use for POSTing data while seeing what is going on. You can also add a maximum time (-m I think).

POSTman
Just overhearing people talk, I believe that “normal” people use a tool called POSTman to do similar things: POST XML, SOAP or JSON data to an endpoint. I haven’t had a need to use it or even to look into it myself. yet.

Conclusion
We have documented some useful switches in curl. POSTing data occurs when using APIs, e.g., RESTful APIs, so these techniques are useful to master. Roadblocks thrown up by web application firewalls or proxy servers can also be easily overcome.

Posted in Web Site Technologies | Tagged , , , | Leave a comment

No Internet, secure WiFi status message in Windows 10

Intro
Finding out how Windows decides if there is an Internet connection or not can be a challenge often posed by trying to do an Internet search comprised or words that are common and therefore used in many other contexts. I have to give credit to someone else who found most of these pertinent links that help explain how Windows decides whether or not your PC has an Internet connection.

What they don’t tell you
I think there are a lot more tests microsoft does than what they’ve documented. In my opinion, based on observation, in addition to the sites they recommend to whitelist, also whitelist

www.msftconnecttest.com

Some PCs get stuck in a loop requesting www.msftconnecttest.com/connecttest.txt indefinitely, which isn’t good for anyone.

Here’s one they don’t mention, of the same ilk:

ipv6.msftconnecttest.com/connecttest.txt

I’m thinking to just leave that one alone, unless you really are fully running on ipv6.

Now if you have a PAC file, what you’re going to see are accesses for
<PAC-file-address>/connecttest.txt

I don’t think that one’s documented either. I’m not yet sure how best to have the PAC file web server respond, where best means the reply which would make the PC most likely to decide Yes I really do have an Internet connection.

References and related
This Pulse Secure article is pretty good. You start with an Internet connection, then launch Pulse Secure vpn, then find you are told there is no longer an Internet connection. This explains why it might be, but in my opinion it is incomplete as it does not even consider the case where an authenticating proxy is the sole gateway to the Internet:
https://kb.pulsesecure.net/articles/Pulse_Secure_Article/KB43805

These are two more articles about VPN tunneling
https://community.pulsesecure.net/t5/Pulse-Desktop-Clients/Pulse-Secure-blocks-Windows-10-apps-from-internet-access/td-p/11944
https://docs.pulsesecure.net/WebHelp/PCS/8.3R1/Home.htm#PCS/PCS_AdminGuide_8.3/About_VPN_Tunneling.htm

network Location Awareness (NLA) and Network Connection Status Indicator (NCSI) are explained in these articles
https://support.microsoft.com/en-us/help/4494446/an-internet-explorer-or-edge-window-opens-when-your-computer-connects
https://support.microsoft.com/en-us/help/2778122/using-authenticated-proxy-servers-together-with-windows-8

Posted in Admin, IT Operational Excellence, Network Technologies | Tagged | Leave a comment