Unlike your home PC, critical security servers basically never just restart on their own. But today I got a Zabbix notification that indeed one of ours had done just that. I set out to investigate to learn the root cause and prevent it from recurring. things took a weird turn at some point.
The details
I asked my colleague to investigate the server, He dutifully looked at various logs. There didn’t seem to be anything amiss according to the logs. Then he showed me a screenshot of the CLI prompt. It had been up for 497 days and some hours. So there was no restart.
We speculated that maybe just the ASM module had restarted, but without any real evidence.
System uptime is an SNMP MIB which we monitor in Zabbix.
This is an actual dream, or more like a nightamre, that I had recently. I guess it’s illustrative of an IT person’s worst fears.
The dream
Well I don’t remember dreams very well and I’m not the kind ot embellish stories to make them sound more interesting so this is going to be brief.
So in this dream I am at the office. My work situation seems to be that I have slightly more access than I need to various systems. The workpllace is a lagre corporate office where processes are followed but still individual contributors want to make a difference so there is some self-imposed pressure to do something of value.
So anyway I find myself in this dream needing to make a configuration change to a monitoring system. Something like a Zabbix implementation. Now I know that I am not chiefly responsible for it, and in fact I should not be modifying it, but I have some idea that what I plan to do will make it better in some way. It’s in fairly widespread usage – about 150 users.
While doing this improvement it asks me for a new administrator password. But it wasn’t exactly that. It changed the administrator password, displayed the new one, and suggested I copy it and save it, which I did. This dialog only displayed for about 10 seconds and that screen went away.
Actually I hadn’t quite had time to save that new admin password, I just had it in my clipboard. I went to notepad++ to paste it and preserve it. There was nothing in my clipboard! That deep, sinking feeling set in. Even if it continues to work, we won’t be able to do patches so it is as good as killed, it will just take a little longer.
I guess we’ve all been there, right?
Inspired by
IRL I was purchasing tickets for Shen Yun. I selected seats and was at the part where you enter credit card info. My Edge browser proposed to enter a generated credit card number, which I generally approve of as an anti-fraud measure, so I let it fill in the info. It needed biometric authentication – face. The next thing I knew the whole browser screen vanished. Edge running on a completely patched Windows 11 PC simply crashed without a word. I think it bears mentioning because unlike the bad old days where crashes were customary, these days it’s not such a quotidien occurence. And when I restarted things, there was no memory of my seat selection but they were blocked from being purchaseable. Kind of a worst case scenario there. I didn’t wish to wait for the 20 minute purchase timer to timeout as few seats were available. Fortunately there were comparable seats in another row. Second time through it did not propose to fill in with a random card and things went through.
I develop simple 3d objects using an approach which could be called 3d-object-as-code. The language is Openscad, and my few objects are all documented here: 3d printing some parts for the house. But that was started long before generative AI took off. So it was incumbent on me to explore what assistance I could get from use of chatgpt.
The details
I wanted an object which would push against our utensil holder to pin it in place within the kitchen drawer, and at the same time make that space available for additional kitchen gadget junk. The final picture is further down below. I started having chatgpt generate the base based on my desired dimensions. So far so good. Then I had it re-do it using less material by making the base a lattice. Things already begin to fall apart. It dropped two of the ends though it did create a lattice pattern.
Now mind you no serious developer would proceed as I have done. I get the code from chatgpt, and paste it into the openscad app to render it to see what it created. Very indirect and inefficient! But this is just for me, so why not…
So anyway, I add the missing side by hand.
A time saver?
Yes! At this point chatgpt has gotten me started. on my own, I get psyched out by decisions as to whether or not to created centered versions of cubes, etc, and my issue when doing it by hand is getting lost in translation, literally. I often find myself three translations deep, and it gets to be overwhelming to think it through. chatgpt unilaterally decided the initial cube would not be translated, so I went with that simpler approach, and that helped.
I think I did manage to get chatgpt to add the legs as well, so that was a help.
chatgpt was good at creating modular and therefore reuseable code which even had nice comments! It’s like looking at your colleague’s code who writes better code than you and picking up a few pointers.
But not after awhile
I realized I needed to stabilize those legs. But what words to use in the prompt? I’m not an engineer. I said something to this effect
Starting from this openscad code, add a small cross arm having the same thickness as the leg in order to anchor the leg more firmly (openscad code...)
The result was laughable as it produced a horizontal piece attached to the bottom of the leg on the one end and attached to nothing at all on the other!
Second attempt:
Starting from this openscad code, add a small brace having the same thickness as the leg in order to anchor the leg more firmly. It should begin at (0,0,20) and end at (20,0,15) (openscad code...)
Still, it ignored these very direct start and end directives. I tried once more with no better results.
I also tried an approach requesting to add a bracing triangle of material to help stabilize the leg, but it laughably added an extruded triangle along the whole length of the leg!
At this point clearly the ai was not acting like an assistant, but a text language generator. It had clearly zero idea what it was doing.
So at that point, it was time negative exercise, useful only for this blog post and to make me humbly admit I do not know how to get the most out of chatgpt.
Finally
I had to add those bracing bits by hand-coding that part. That involved a rotation, a translation and a difference. It could have been worse.
// DrJ 1/2025. Parameters in mm
// Dimensions of the box
width = 110;
length = 150;
thickness = 3;
epsilon = 1;
brace_angle = 30;
brace_z = 20;
spacing = 53; // Spacing between the lines in the criss-cross pattern
line_thickness = 4; // Thickness of the lines in the pattern
leg_height = 53; // Height of the legs
leg_width = 6; // Width of the legs
leg_length = 10; // Length of the legs
//width = width – line_thickness; // correction
side_height = thickness; // Height of the side leg
side_width = width; // Width of the side leg
side_length = 10; // Length of the side leg
module leg() {
// Create a leg
cube([leg_width, leg_length, leg_height]);
}
module shave_cube(){
translate([0,-epsilon,thickness]){cube([side_width,side_length+2*epsilon,leg_height]);};
}
module leg_brace() {
// Create a leg brace
difference(){
translate([0,0,-brace_z]){rotate(a=[0,brace_angle,0]){cube([leg_width, leg_length, leg_height]);}};
shave_cube();
}
}
module side_leg() {
// Create a leg
cube([side_width, side_length, side_height]);
}
module criss_cross_pattern() {
for (i = [0 : spacing : length]) {
// Horizontal lines
translate([0, i, 0]) {
cube([width, line_thickness, thickness]);
}
}
for (i = [0 : spacing : width]) {
// Vertical lines
translate([i, 0, 0]) {
cube([line_thickness, length, thickness]);
}
}
}
// Create the criss-cross box
criss_cross_pattern();
// Add legs at two corners
translate([0, 0, -(leg_height-thickness)]) {
leg(); // Leg at the bottom-left corner
}
translate([0, length – leg_length, -(leg_height-thickness)]) {
leg(); // Leg at the bottom-left corner
}
// add stronger sides
translate([0, 0, 0]) {
side_leg(); // Leg at the bottom-left corner
}
translate([0, length – side_length, 0]) {
side_leg(); // Leg at the bottom-left corner
}
leg_brace();
translate([0,length – side_length,0]){
leg_brace();}
//shave_cube();
Conclusion
I got not-so-great results in my attempt to use the chatgpt o4 generative ai offered by Duckduckgo. The basic stuff, yes, it got me started and taught me how to make good modular openscad code. Anything remotely complex, forget about it. You want to treat ai like an assistant, right, but this assistant has near zero understanding of what I want and did not learn even after multiple attempts within the same chat session. It should be put out to pasture…
However, I am always willing to take the fall. I was just going by the seat of my pants with regards to prompt engineering. Maybe if I had chosen better prompts, or let ai have freer reign to do the whole design I would have experienced better results. But shouldn’t my “assistant” be better at understanding me?
We wished to run a pipeline every five minutes, but when you do the math, this will result in its running more than 1000 times per week, which according to the documentation is forbidden. On the other hand, we are using private agents – our own – so why should Micrsoft put limits on how often we run jobs on them??
The details
Given that there are 10080 minutes in a week, to arrive at fewer than 1000 pipelkine runs per week you’d need to pace out your jobs at no more than run once every 12 minutes. And that’s what I had been doing. So then I would create a second pipelikne running the same code, but running it an the inbetween times to end up with a logical job which runs every six minutes. But is this approach really required for our private agents?
We decided to put this to a real test. I created a Hello World yaml file and ran it every minute. The results are not at all what we expected!
The results
Essentially, the job runs 10 times out of every 15 minutes. This is another published limit. And you see this effect right away. So this is like some kind of burst rate limiting you might say, and it applies.
And during those times when it’s not being run, you don’t see it paused or anything. It simply isn’t run. But you can run it by hand (I think) and it will run.
So then you think, OK, limits apply, even to private agent pools. Then we left it running, and something funny happened.
After about 640 runs in the course of 24 hours, it simply stopped. Then about three days later it started up again, ran about 637 times, then stopped again.
So there seems to be an additional unpublished limit of something like 640 runs in a 72 hour interval.
But, we were able to exceed 1000 runs in a week, for what it’s worth.
Then I let the job run awhile. It seemed to want to run 635 times on Mondays, then stop the whole rest of the week. IDK…
Alternatives
I guess we were not using pipelines for what it was intended. It’s not really to be considered cron on steroids. We’ll be looking at Azure Functions to see if it’s a better fit for our requirements.
Conclusion
Treating pipelines like “cron on steroids” is not what it was designed for. Even when you use your own agents in your pool, your Azure Pipeline job will be rate limited to about 10 runs per 15 minutes, about 640 runs per three day interval (unpublished limit), though you can exceed 1000 runs per week. These limits prevent you from executing a run every five minutes! If you need to execute a job so often, consider finding a different approach!
My wife asked my assistance to find the source of the daily alarm which was nagging her at 6:20 AM every morning. I don’t use an iPhone so I was pretty clueless myself.
The details
Of course she had done the obvious things like look at the clock for set alarms. And at installed apps for alarms. Nothing.
Yet every day – unless the iPhone was turned completely off – this alarm would go off at 6:20 AM. And her Apple iWatch, or whatever it’s called, also had some message about this alarm.
We searched all installed apps for “alarm” and “clock” but there was nothing left to look at. Maybe one of her health apps? Nope. doesn’t seem to be. Maybe the Army Knife app with all its little useful gadgets? Nope, no alarm clock there.
The breakthrough
Then I got an idea. Since the wake-up screen mentioned domething about sleep, I decdied to search the phone for sleep. And voila, there is a sleep app, or at least sleep settings. And it was set to end her sleep at 6:20 AM.
So you see the misdirection at work? We kept thinking in terms of clock and alarm. But Apple just thinks of it as sleep and calls it as such.
Case: closed
Conclusion
Two people were frustrated for days trying to find the source of an iPhone alarm, which eventually was found. Beware that there is a sleep app. We followed the leads on the Internet about turning off certain notifications, which led nowhere.
In full disclosure this case was not one I contributed to in any way, unlike all the others I’ve reported on. Nevertheless, source who did work on this case told me sufficient details and it is an interesting case.
The setup
For this case to make any sense, you need to understand the background. If I got it right, some people were trying to restore a backup version of Windows 11 Professional. When they did this restore, they found the problem that they were not pciking up an IP address via dhcp if they were on a company network. If they did the restore while on a home office network it went OK.
So imagine the comlpexity in a modern IT environment this presents. You have the PV vendor, HP, the OS vendor, Microsoft, the dhcp service operator, in-house, the LAN service provider and the network gear vendor, Cisco. The fault could lie anywhere. They all initially claim their stuff is working fine (which is always the default statement) and look elsewhere.
So what I like to say is that any hypothesis is unlikely, yet one of them will prove to be correct, eventually.
More details
Packet traces showed the DHCP Discover request being sent by the PC, but not arriving to the DHCP server. Ah, you say, simple: the switch is guilty here of dropping the DHCP Discover packet, fix it. After all, “eating” dhcp packets is something misconfigured switches do all the time if dhcp snooping is misconfigured.
Yet the LAN service provider says the switch isn’t misconfigured. So they have to open a case with the switch vendor to understand the drop. I’m not sure where that support case went, meanwhile…
The in-house expert troubleshooters were able to take a second trace from a PC which did pick up an IP address after a restore. This restore feature of course used to work when it was initially released.
I still use my home-grown slideshow software based on Raspberry Pi, which is quite a testament to its robustness as it has been running with only minor modifications for many years now. one recent improvement has been my addition of being able to handle photos from recent iPhones which save photos in the new-to-me HEIC format. My original implementation only handles JPEGs and PNG file types, so it was skipping all our recent iPhone photos.
I figured there just had to be a converter our there which would even work on the RPi, which of course there was, heif-convert. But it has an oddity when it comes to rotation. It converts the HEIC to a jpeg, fine, but it rotates them, but it also leaves all the EXIF meta data, including the orientation meta data, as is. This in turn means display software such as fbi may try to rotate the picture a second time. Or at least that’s what happened to my software where one of my steps is an explicit rotate. That step was creating a double rotation.
So I needed a tiny program which left all the EXIF meta data alone except the rotation, which it sets to 0, i.e., do not rotate. Seeing nothing out there, I developed my own.
The details
Here is that script, which I call 0orientation.py:
While my laptop was being shipped to me I wanted to be as productive as possible using my Samsung Galaxy A35. I was vaguely aware of the availability of Microsoft 365 apps such as Outlook. How far could I take this…?
The recipe
To cut to the chase, I was maybe 60 – 70 % effective. I used equipment found in the typical IT person’s home plus one inexpensive purchase from Walmart.
Here is what I used:
HDMI monitor
old Amazon firestick
cheap bluetooth keyboard purchased from Walmart
phone stand
And here’s what I really wished I had but did not:
bluetooth mouse
Which apps worked well:
Outlook
Teams, especiallt chat, less so the meetings function
One Note
Edge
VPN client
I must say the bluetooth keyboard worked really well for doing some serious typing up of emails.
How the external monitor worked
So I “came up” (in quotes because I’m sure many others figured out this same thin) with the idea of casting my phone screen onto an external monitor by way of the screen mirroring capacibility available on even the oldest amazon Firestick. On the phone you simply go to Smart View Mirror Screen.
So that prevented me from having to hold the phone at least while I was drafting emails.
But, and it’s a big one, is that the external monitor was not a TV and the sound from meetings was killed by this setup. And I did not see a way to keep audio local to the phone while only casting the screen.
A smaller problem is that the refresh lag is quite noticeable under conditions of rapid screen refresh. So it may take a second or two to show what the phone’s screen shows.
Still, it’s pretty cool.
I would have bought a bluetooth mouse but it simply wasn’t available at my local Walmart. I was pretty inconvenienced without it having to constantly touch the phone screen for various things.
And the external keyboard
Pretty well. Even some shortcuts worked. Alt-TAB, which I use a lot to switch between apps has some kind of vaguely similar effect on the phone, but not to the point where I could rely on it usefully. The unlock shortcut button sort of woke the up the phone screen at least.
TAB helped me to pop from one field in the form to the next the way I would use it on a PC.
Overall responseiveness was satisfactory.
The small form factor was not a detriment, and maybe even an advantage since it’s so light and portable.
What if you have an HP G5 docking station lying around?
Well I do. It has a USB-C cord which you normally plug into your HP laptop. But I didn’t have the power supply for it so I couldn’t use it when I would have needed it. Well, it basically works with a Samsung phone – at least the keyboard and mouse worked. In my 10 second testing the attached HDMI display did not automatically show anything. Maybe there are some phone settings which would need to be changed. I didn’t mess with it at all.
But it’s cool seeing a mouse working. It suddenly paints a mouse pointer on your phone screen which you can move around and click to launch an app.
Apps are often baby implmentations
At first I struggeled with the Outlook app, trying to use it as though it were my full-blown Outlook client on my PC. It only had one week’s worth of messages, which was pretty limiting since I was out for more than a week. Then I had a lightbulb moment and remembered that the Web version of Outlook worked on my phone. So I switched to using Outlook through the Edge browser – much better for me. That’s https://outlook.office.com/ . I could get full history and therefore do more reliable searching through messages.
Responsive Design work-around
Sometimes the mobile app version of a web site just doesn’t have the featuires, but looks nice. Edge has a feature you can choose called View Desktop Site which gives you the “real” web site. Now it may look tiny, forcing you to expand and shrink with two fingers. But at least it will generally work.
Where is Notepad or Notepad++
I didn’t look for an app. I suppose there is one. Somtimes you just want to inspect your clipboard. I settled on pasting into a new draft Outlook email to do my visual inspection of my clipboard.
References and related
I prepared the above solution with one day’s notice. If you had a couple days you might check out the Samsung Dex. I guess it would work for modern Samsung Galaxy phones though I haven’t tried it myself.
The web version of business Outlook, which is a pretty good implementtion of the full-blown client is https://outlook.office.com/
A colleague of mine in another timezone created the necessary DKIM records in Cloudflare for a new mail domain. There was panic as the mail team realized too late these records were not validating. I was called in to help. Unfortunately at the beginning I only my smartphone to work with. Did you ever try to do this kind of detail work with a smartphone? Don’t.
The details
The smartphone thing is worthy of a separate post. I was getting somewhere, but it is like working with both hands tied behind yuor back.
So the mail team is telling me the dkim record doesn’t validate and showing me a screenshot of something from mxtoolbox to prove it.
I of course want to know the details so I can verify my mistakes before anyone else gets to – that’s how I roll!
Well, mxtoolbox, has a free validator for these dkim records which is pretty useful. Go to Supertool, then click the dropdown and select DKIM. A DKIM record involves a domain and a selector. Here’s a real live example for Hurricane Electric which uses he.net as their sending mail domain. So in their DNS the DKIM txt record for them looks like this when viewed from dig:
This is the value for this record: henet-20240223-153551._domainkey.he.net
To validate this DKIM record in mxtoolbox we pull out the token in front of _domainkey and refer to it as the selector, and drop the _domainkey and enter it like this:
The problem with the DKIM entry I was assigned to rescue was that the DIM syntax check was not passing. Yet it looked just like the way the mail team requested. What is going on? How can this problem be broken down into smaller steps???
To be continued…
Appendix A
How did I know the exact selector for Hurricane Electric?
I looked at the SMTP headers of an email I received from them. I found this section:
d must stand for domain and s for selector. This is all considered public information, albeit somewhat obscure. So the domain is he.net and the selector is henet-20240223-153551.
This case was solved today. Now I just need to find the time to write it up!
I belong to a team which runs many dozens of dns servers. We have basic but thorough monitoring of these servers using both Zabbix and Thousandeyes. One day I noticed a lot of timeout alerts so I began to look into it. One mystery just led to another without coming any closer to a true root cause. There were many dead ends in the hunt. Finally our vendor came through and discovered something…
The details
The upshot are these settings we arrived at for an ISC BIND server:
This is in the options section of the named.conf file. That’s it! This is on a four-core server with 16 GB RAM. The default values are:
tcp-listen-queue: 10
tcp-clients: 10
tcp-idle-timeout: 60 seconds
Those defaults will kill you on any reasonably busy server, meaning, one which gets a couple thousand requests per second.
To be continued…
Conclusion
We encountered a tough situation on our ISC BIND DNS servers. TCP queries, and only TCP queries, were responded to slowsly at best or not at all. after many flase starts we found the solution was setting three tcp parameters in the options section of the configuration file, tcp-listen-queue, tcp-clients and tcp-idle-timeout. We’ve never had to mess with those parameters after literally decades of running ISC BIND. Yet we have incontrovertible proof that that is what was needed.