Categories
Admin Uncategorized Web Site Technologies

The IT Detective Agency: Internet Explorer cannot display https web page

Intro
It’s a weird thing when a site that’s always worked for you suddenly stops working. Such was the behaviour observed today by a friend of mine. He could no longer access an old Oracle Enterprise Manger web site, and just this one web site. All other web sites were fine. What’s up, he asked?

The details
Well we tried this and that, reloading pages, re-starting the PC, and tests to make sure the DNS resolution was occurring correctly. It was. We logged on to his PC as another user to try the access. This tests registry settings specific to his userid. I thought that would work, but it did not. I tried the web site from my PC – worked great! The people around him also worked great. Give up? Never. For him the error page popped up quickly, by the way. He didn’t have Firefox, but I was tempted to have him install it and try that. I was pretty sure it would have worked.

He did have putty. We used his putty to telnet access the server on the same port as the https listener – we could connect, though of course we couldn’t really do anything beyond that. So there was no firewall-type issue.

We tried other https sites from his browser – no problems with those.

I was hemming and hawing and muttering something about publisher certificate revocation when that prompted someone to recall that a related Microsoft setting adjustment had come out just last week. It requires that web sites have certificate lengths of at least 1024 bits. For discussion see this article. Could that be related to this problem he asks me? Could it? Could it? I quickly checked the key bit length of the server certificate on the OEM server. Yup. 512 bits. Then I checked the key length of another OEM server that he could still access. Yup. 1024 bits. It was a newer installation so that actually does make sense. This popular article about ciphers also mentions how to use openssl to find the key length (openssl s_client …).

Why could I and others do the access? Simple. We hadn’t (yet) received the patch. When we do, we won’t be able to access it either.

So whose fault is it, anyways? I kind of agree with Microsoft on this one. If you’re still running web sites with 512 bit-length keys, it’s time to change your certificates to something longer and more secure. After all on the Internet we’re required to have 2048 bit-length keys for almost two years now.

Problem is, it’s not so obvious how to change this key in OEM. It may be buried in a java keystore.

Case: almost closed!

Conclusion
With a little help from my friends I solved the case of the browser with the message Internet Explorer cannot display the webpage. Like all such problems it was quite puzzling for awhile, but once understood all the symptoms made sense and could be explained rationally.

Categories
Admin Consumer Tech

Useful Links – Technical Resources

Intro
From time-to-time I come across sites that really help out, and ones that true, might be found quickly in a Google search, but some offer capabilities you may not have even realized existed.

The links
So without further ado, here they are. They need some organizing yet!

Is this site accessible from china through the Great Firewall of China?

https://www.chinafirewalltest.com/?siteurl=drjohnstechtalk.com

– Geographic location finder for any IP address. There are zillions of them out there. Some are overly aggressive with advertizing so be wary. Here’s a really good one that shows your IP and perceived location without the ads: https://www.vpnmentor.com/tools/ipinfo/

although I normally use https://ipinfo.io/ .

Formula plotter (math) – very nice

https://www.desmos.com/calculator

Naming Convention

ISO 2-letter country codes

Everyone is pretty familiar with these two-letter codes. Theer is also a 3-letter convention as well. But what is much less well known is the UN way to designate almost every single location on the planet with a 3-letter code:

UNECE 3-letter place/city/town codes

Click on a country and it will show you all the 3-letter codes.

This is not the same as the IATA airport codes! It is more extensive.

Networking
Check your speed by downloading one of the large files from ftp://speedtest.tele2.net/ such as ftp://speedtest.tele2.net/50MB.zip .

Reputation links
– Reputation information for any mail server. Just plug in the IP. Will also show what blacklists the server is on and the reputation of others servers on that segment. I think it’s run by Cisco. http://www.senderbase.org/
– Reputation of any URL, sponsored by Bluecoat, makers of the free edition K9 Webfilter. http://sitereview.bluecoat.com/sitereview.jsp

Webmaster links
archive.org will show what your site looked like in the past, sometimes even years in the past.
Now retired: – www.alexa.com, to see how unpopular your web site is. Retired by Amazon May 2022! Not sure if there is an equivalent…
– Say you see a photo on a web site and you want to know if it was used elsewhere, or if the site cropped it

– Open source information and knowledgeable discussion. http://www.softpanorama.org/

-Is this web site(s) up?

https://httpstatus.io/

Generating a custom favicon

https://favicon.io/ is an honest broker. you can generate a favicon from letters (e.g., CPP). There are lots and lots of options.

Security

-0 Patch (I learned of this 12/2023) may be an option to get continued securtity coverage when your Windows 10 support runs out in October 2025. https://0patch.com/
-This is a very good article discussing the Windows 10 to Windows 11 upgrade options for those whose hardware doesn’t support Windows 11: https://www.msn.com/en-us/news/technology/when-windows-10-support-runs-out-you-have-5-options-but-only-2-are-worth-considering/ar-AA1lFA82?cvid=4d6aafa9d73448dbc024424b335d1451&ocid=winp2fptaskbarent&ei=15

-Good discussion of ongoing security matters, though too heavily weighted on Social networking. Articles daily with interesting reader comments. http://nakedsecurity.sophos.com/
– Web site malware checker. http://sitecheck.sucuri.net/scanner/. I’ve used it once. I guess it’s OK, but probably not nearly as thorough as commercial products. I guess it doesn’t look for vulnerabilities, just malware.
urlquery.net, a service for analyzing and detecting web-based malware

IT
– most under-appreciated blog by an IT professional, networking concentration. http://drjohnstechtalk.com/blog/ Hey, I can toot my own horn!
– Screen-sharing application join.me
– Ookla’s Down Detector gives the status of hundreds of major web sites. https://downdetector.com/

Home User
– Best broadband speed test. https://fast.com/ It’s simple, but doesn’t shower you with aggressive ads.

– speedtest also looks good. https://speedtest.net/

Another better alternative to speedtest.net (again no aggressive ads) is https://speedtest.expereo.com/

I think nperf.com may be the best, however. https://nperf.com/

Prevent your neighbor’s WiFi SSIDs from being offered to you: https://www.howtogeek.com/331816/how-to-block-your-neighbors-wi-fi-network-from-appearing-on-windows/

– Permanent HTTP site – will never be https: http://neverssl.com/. It can be hard to find a site that doesn’t run SSL these days so this can be useful, e.g., when you want to sign on to guest Wifi such as hotels offer.

Find an Android phone: http://google.com/android/find

Affect of rising water levels on coastal areas: https://coastal.climatecentral.org/

Programmers

Here’s a really good introduction to REST API by Microsoft: https://docs.microsoft.com/en-us/azure/architecture/best-practices/api-design

Computer and Linux Hobbyist
– Cheapest low-power Linux computer. http://www.raspberrypi.org/. At about $35 this looks really interesting for the hobbyist.

Towards a Responsible and Ethical Internet
– a search engine that doesn’t track! I just heard about this on Marketplace. Unfortunately it’s not as sophisticated as Google with type-aheads. I checked a few searches that should bring up some of my blog posts and – they did – so it must have some reach into obscure web sites duckduckgo.com

And another I’ve just learned about: https://www.startpage.com/

Security

Check out how many times yuor email address and other info about you have been leaked on the Internet. It’s enlightening and also depressing: https://haveibeenhacked.com/

A victim of Identity Theft?

Go to this checklist maintained by the Federal Trade Commission: https://IdentityTheft.gov/

Categories
Admin Internet Mail Linux Network Technologies

The IT Detective Agency: mail server went down with an old-school problem

Intro
I got a TXT from my monitoring system last night. I ignored it because I knew that someone was working on the firewall at that time. I’ve learned enough about human nature to know that it is easy to ignore the first alert. So I’ve actually programmed HP SiteScope alerts to send additional ones out after four hours of continuous errors. When I got the second one at 9 PM, I sprang into action knowing this was no false alarm!

The details
Thanks to a bank of still-green monitors I could pretty quickly rule out what wasn’t the matter. Other equipment on that subnet was fine, so the firewall/switch/router was not the issue. Then what the heck was it? And how badly was it impacting mail delivery?

This particular server has two network interfaces. Though the one interface was clearly unresponsive to SMTP, PING or any other protocol, I hadn’t yet investigated the other interface, which was more Internet-facing. I managed to find another Linux server on the outer network and tried to ping the outer interface. Yup. That worked. I tried a login. It took a whole long time to get through the ssh login, but then I got on and the server looked quite normal. I did a quick ifconfig – the inner interface listed up, had the right IP, looked completely normal. I tried some PINGs from it to its gateway and other devices on the inner network. Nothing doing. No PINGs were returned.

I happened to have access to the switch. I thought maybe someone had pulled out its cable. So I even checked the switch port. It showed connected and 1000 mbits, exactly like the other interface. So it was just too improbable that someone pulled out the cable and happened to plug another cable from another server into that same switch port. Not impossible, just highly improbable.

Then I did what all sysadmins do when encountering a funny error – I checked the messages file in /var/log/messages. At first I didn’t notice anything amiss, but upon closer inspection there was one line that was out-of-place from the usual:

Nov  8 16:49:42 drjmailgw kernel: [3018172.820223] do_IRQ: 1.221 No irq handler for vector (irq -1)

Buried amidst the usual biddings of cron was a kernel message with an IRQ complaint. What the? I haven’t worried about IRQ since loading Slackware from diskettes onto my PC in 1994! Could it be? I have multiple ways to test when the interface died – SiteScope monitoring, even the mail log itself (surely its log would look very different pre- and post-problem.) Yup.

That mysterious irq error coincides with when communication through that interface stopped working. Oh, for the record it’s SLES 11 SP1 running on HP server-class hardware.

What about my mail delivery? In a panic, realizing that sendmail would be happy as a clam through such an error, I shut down its service. I was afraid email could be piling up on this server, for hours, and I pride myself in delivering a faultless mail service that delivers in seconds, so that would be a big blow. With sendmail shut down I knew the backup server would handle all the mail seamlessly.

This morning, in the comfort of my office I pursued the answer to that question What was happening to my mail stream during this time? I knew outbound was not an issue (actually the act of writing this down makes me want to confirm that! I don’t like to have falsehoods in writing. Correct, I’ve now checked it and outbound was working.) But it was inbound that really worried me. Sendmail was listening on that interface after all, so I didn’t think of anything obvious that would have stopped inbound from being readily accepted then subsequently sat on.

But such was not the case! True, the sendmail listener was available and listening on that external interface, but, I dropped a hint above. Remember that my ssh login took a long while? That is classic behaviour when a server can’t communicate with its nameservers. It tries to do a reverse lookup on the ssh client’s source IP address. It tried the first nameserver, but it couldn’t communicate with it because it was on the internal network! Then it tried its next nameserver – also a no go for the same reason. I’ve seen the problem so often I wasn’t even worried when the login took a long time – a minute or so. I knew to wait it out and that I was getting in.

But in sendmail I had figured that certain communications should never take a long time. So a long time ago I had lowered some of the default timeouts. My mc file in the upstream server contains these lines:

...
dnl Do not use RFC1413 identd. p 762  It requires another whole in the F/W
define(`confTO_IDENT',`0s')dnl
dnl Set more reasonable timeouts for SMTP commands'
define(`confTO_INITIAL',`1m')dnl
define(`confTO_COMMAND',`5m')dnl
define(`confTO_HELO',`1m')dnl
define(`confTO_MAIL',`1m')dnl
define(`confTO_RCPT',`1m')dnl
define(`confTO_RSET',`1m')dnl
define(`confTO_MISC',`1m')dnl
define(`confTO_QUIT',`1m')dnl
...

I now think that in particular the HELO timeout (TO_HELO) of 60 seconds saved me! The upstream server reported in its mail log:

Timeout waiting for input from drjmailgw during client greeting

So it waited a minute, as drjmailgw tried to do a reverse lookup on its IP, unsuccessfully, before proceeding with the response to HELO, then went on to the secondary server as per the MX record in the mailertable. Whew!

More on that IRQ error
Let’s go back to that IRQ error. I got schooled by someone who knows these things better than I. He says the Intel chipset was limited insofar as there weren’t enough IRQs for all the devices people wanted to use. So engineers devised a way to share IRQs amongst multiple devices. Sort of like virtual IPs on one physical network interface. Anyways, on this server he suspects that something is wrong with the multipath driver which is loaded for the fiber channel host adapter card. In fact he noticed that the network interface flaked out several times previous to this error. Only it came back after some seconds. This is the server where we had a very high CPU when the SAN was being heavily used. The SAN vendor checked things out on their end and, of course, found nothing wrong with the SAN equipment. We actually switched from SAN to tmpfs after that. But we didn’t unload the multipath driver. Perhaps now we will.

Feb 22 Update
We haven’t seen the problem in over three weeks now. See my comments on what actions we took.

Conclusion
Persistence, patience and praeternatural practicality paid off in this perplexing puzzle!

Categories
Admin

Nmap: Swiss Army Knife of network utilities

Intro
I just wanted to put in a plug for nmap. It’s a very useful tool for any network specialist. I show a use case that came up today.

The details
While cleaning up DNS entries I came across a network segment that didn’t seem to have any active network devices, at least not after I cleaned up the old DNS entries for inactive devices.

So I wanted to see if I could tell the networking tech that this subnet is unused and could be allocated for some other purpose.

I remembered using nmap years ago, and that it was a powerful tool for this kind of thing. What I had in mind was to ping every IP on this segment to see if there were any undocumented hosts.

As it turns out I didn’t even have it installed, but it was very easy to get:

On SLES:
$ zypper install nmap

On CentOS:
$ yum install nmap

It doesn’t get easier than that!

A quick review of the man page showed that what I wanted was indeed possible. Here’s the syntax for a systematic PING sweep through a subnet:

$ nmap −sP 10.101.192.0/24

Starting Nmap 4.75 ( http://nmap.org ) at 2012-11-08 10:21 EST
Host 10.101.192.5 appears to be up.
Host 10.101.192.10 appears to be up.
Host 10.101.192.151 appears to be up.
Host 10.101.192.152 appears to be up.
Host 10.101.192.153 appears to be up.
Nmap done: 256 IP addresses (5 hosts up) scanned in 1.28 seconds

Now I know that subnet has rogue or at least undocumented hosts and is not unused!

The original usage for nmap, at least for me, was to fingerprint an unknown host:

$ nmap −A −T4 ossim.drj.com

Interesting ports on ossim.drj.com (10.22.235.19):
Not shown: 996 filtered ports
PORT    STATE  SERVICE  VERSION
22/tcp  open   ssh       (protocol 2.0)
80/tcp  open   http     Apache httpd
|_ HTML title: 302 Found
443/tcp open   ssl/http Apache httpd
|_ HTML title: Site doesn't have a title.
514/tcp closed shell
1 service unrecognized despite returning data. If you know the service/version, please submit the following fingerprint at http://www.insecure.org/cgi-bin/servicefp-submit.cgi :
SF-Port22-TCP:V=4.75%I=7%D=4/25%Time=51793070%P=x86_64-suse-linux-gnu%r(NU
SF:LL,29,"SSH-2\.0-OpenSSH_5\.5p1\x20Debian-6\+squeeze2\r\n");
Device type: WAP|general purpose|PBX
Running (JUST GUESSING) : Linux 2.6.X|2.4.X (93%), Vodavi embedded (85%)
Aggressive OS guesses: OpenWrt 7.09 (Linux 2.6.22) (93%), OpenWrt 0.9 - 7.09 (Linux 2.4.30 - 2.4.34) (92%), Linux 2.6.20.6 (89%), Linux 2.6.21 (Slackware 12.0) (88%), OpenWrt 7.09 (Linux 2.6.17 - 2.6.21) (88%), Linux 2.6.19 - 2.6.21 (88%), Linux 2.6.22 (Fedora 7) (88%), Vodavi XTS-IP PBX (85%), Linux 2.6.22 (85%)
No exact OS matches for host (test conditions non-ideal).
 
TRACEROUTE (using port 21/tcp)
HOP RTT    ADDRESS
1   0.87   10.202...
2   0.38   ...
3   0.57   ...
4   6.10   ...
5   114.64 ...
6   119.79 ...
7   103.43 ossim.drj.com (10.22.235.19)
 
OS and Service detection performed. Please report any incorrect results at http://nmap.org/submit/ .
Nmap done: 1 IP address (1 host up) scanned in 35.02 seconds

Now that was kind of an unusual example in which nmap wasn’t too sure about the OS. Usually you get a positive ID of some sort. That’s a chatty server and I’m still not sure what it is.

Nmap can be used for nasty things and in an impolite way, network-wise. So be careful to tone it down. Target your hosts and protocols with care. It can guess what OS a host is running, what ports are open, all kinds of amazing stuff.

I checked PING and did not see a built-in capability to do a PING sweep, though it would have been easy enough to script it. That was my backup option.

Once I had to check on a single UDP port being open on port 80 for a webcast client called Kontiki (they call this protocol KDP). No other ports were open, necessitating the -PN switch.

Single UDP port check
$ nmap −PN −sU -p 80 29.239.11.4

Starting Nmap 4.75 ( http://nmap.org ) at 2013-07-23 13:59 EDT
Interesting ports on 29.239.11.4:
PORT   STATE         SERVICE
80/udp open|filtered http
 
Nmap done: 1 IP address (1 host up) scanned in 2.15 seconds

Three TCP ports checked
$ nmap ‐PN ‐sS ‐p 445,28080,28443 12.92.96.37

Results of that scan

Starting Nmap 5.51 ( http://nmap.org ) at 2017-04-13 09:18 EDT
Nmap scan report for 12.92.96.37
Host is up.
PORT      STATE    SERVICE
445/tcp   filtered microsoft-ds
28080/tcp filtered unknown
28443/tcp filtered unknown
 
Nmap done: 1 IP address (1 host up) scanned in 3.06 seconds

filtered” means there were no reply packets to my SYN packets, usually a sign of an intervening firewall dropping packets. I’m not sure why it describes the host as “up” when actually it is down or behind a firewall. A state of closed indicates that a RST packet was received in reply, indicating that the port is closed on the host itself and it wasn’t a firewall that prevented the test from succeeding. the third possible state is open, which of couse means that it replied with a SYN-ACK to that probe on that port.

To fix the source port add a -g to the above command. E.g., some firewalls have trouble with permitting inbound UDP packets from port 53 so to test for that you throw in a -g 53 and try some random high destination port.

I needed to spoof another host’s IP address and send a simple PING (ICMP request) to diagnose what was going wrong with the reply. Here’s how I did that:

$ nmap −PE −e eth0 −S 10.42.48.1 10.1.145.10

But then I realized what I really needed to do to emulate the problem is to send a single TCP SYN packet to port 8081, without the accompanying ICMP probes that nmap is wont to throw in there first. Here’s how I built up that probe:

$ nmap −PN −sS −p 8081 −−max-retries 0 −e eth0 −S 10.42.48.1 10.1.145.10

Check if a web server is running
$ nmap −PN −p T:80,443 drjohnstechtalk.com
This will check both ports 80 and 443. It doesn’t execute any HTTP protocol. It’s just a quick and dirty test.

don’t have nmap but have something like netcat instead? A good tcp port check with netcat is
netcat -vzw5 <host> <port>. Here’s an actual example.

$ netcat ‐vzw5 drjohnstechtalk.com 443

DNS mismatch
drjohnstechtalk.com [50.17.188.196] 443 (https) open

Conclusion
Nmap is a great network tool that every IT network tech should be familiar with.

References and related
A more capable and complicated packet generation tool is scapy. I describe it in this blog post.

A simpler network for Windows (simpler than nmap for Windows) is PortQry. It was created by Microsoft.

Categories
Admin Network Technologies

Firewall is a significant drag on download speeds

Intro
This post might be a restatement of the obvious to some, but I thought it was noteworthy enough to measure and mention this affect. I was twiddling my thumbs during a long sftp upload when I began to notice these transfers I was doing went really quickly between some servers, and not so much between others. How to control for all variables except the ones I wanted to vary? How to measure things in such a way that an overworked network technician with vested interests in saying the status quo is “good enough” will listen to you? These are things I wrestled with.

The details

To be continued…

Categories
Admin Network Technologies

DIY Home Power Monitoring Solution – Perfect for Sandy

Intro
My recent experience losing power thanks to Sandy has gotten me thinking. How can I know when my power’s on? Or what if it gets shut off again, which by the way actually happened to me? I realized that I have all the pieces in place already and merely need to take advantage of the infrastructure already out there.

The details
I started with:

  • a working smartphone
  • an enterprise monitoring system
  • a SoHo router in my home

And that’s pretty much it!

The smartphone doesn’t really matter, as long as it receives emails. I guess a plain old cellphone would work just as well – the messages can be sent as text messages.

For enterprise monitoring I like HP SiteScope because it’s more economical than hard-core systems. I wrote a little about it in a previous post. Nagios is also commonly used and it’s open source, meaning free. Avoid Zabbix at all costs. Editor’s note: OK. I’ve changed my mind about Zabbix some seven years later. It’s still confusing as heck, but now I’m using it and I have to admit it is powerful. See this write-up.

A good SoHo router is Juniper SSG5. It extends the enterprise LAN into the home. You can carve up a Juniper router like that and provide a Home network, Work network, Home wireless and Work wireless. It’s great!

The last requirement is the key. The SoHo router at my house is always on, and so the enterprise LAN is always available, as long as I have power. Get it? I defined a simple PING monitor in SiteScope to ping my SoHo router’s WAN interface, which has a static private IP address on the enterprise LAN. If I can’t ping it, I’ve lost power, and use my monitoring system to send an email alert to my smartphone. When, or in the case of Sandy’s interminable outage, if, power ever comes back on, I send another alert letting me know that as well. If you’re not using SiteScope make sure you send several PINGs. A PING can be lost here or there for various reasons. siteScope sends out five PINGs at a regular interval, as a guideline.

Alternative
Of course if I had a business-class DSL or cable modem service with a dedicated IP I could have just PINGed that, but I don’t. With regaulr consumer grade service your IP can and will change from time to time, and using a dynamic DNS protocol (like dyndns) to mask that problem is a bit tricky.

Yes, if you call the power company they offer to call you back when your power is restored, but I like my monitor better. It tells me when things go off as well as on.

Conclusion
This is my necessity-is-the-mother-of-invention moment. Thank you, Sandy!

Categories
Admin DNS Internet Mail

The IT Detective Agency: can’t get email from one sender

Intro
For this article to make any sense whatsoever you have to understand that I enforce SPF in my mail system, which I described in SPF – not all it’s cracked up to be.

The details
Well, some domain admins boldly eliminated their SOFTFAIL conditions – but didn’t quite manage to pull it off correctly! Today I ran into this example. A sender from the domain pclnet.net sent me email from IP 64.8.71.112 which I didn’t get – my SPF protection rejected it. The sender got an error:

550 IP Authorization check failed - psmtp

Let’s look at his SPF record with this DNS query:

$ dig txt pclnet.net

; <<>> DiG 9.8.2rc1-RedHat-9.8.2-0.10.rc1.el6_3.2 <<>> txt pclnet.net
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 42145
;; flags: qr rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 0
 
;; QUESTION SECTION:
;pclnet.net.                    IN      TXT
 
;; ANSWER SECTION:
pclnet.net.             300     IN      TXT     "v=spf1 mx ip4:24.214.64.230 ip4:24.214.64.231 ip4:64.8.71.110 ipv4:64.8.71.111 ipv4:64.8.71.112 -all"

That IP, 64.8.71.112 is right there at the end. So what’s the deal?

Well, Google/Postini was called in for help. They apparently still have people who are on the ball because they noticed something funny about this SPF record, namely, that it isn’t correct. Notice that the first few IPs are prefixed with an ip4? Well the last IPs are prefixed with an ipv4! They are not both valid. In fact the ipv4 is not valid syntax and so those IPs are not considered by programs which evaluate SPF records, hence the rejection!

My recourse in this case was to remove SPF enforcement on an exception basis for this one domain.

Case closed!

Conclusion
It’s now a few months after my original post about SPF. I’m sticking with it and hope to increase its adoption more broadly. It has worked well, and the exceptions, such as today’s, have been few and far between. It’s a good tool in the fight against spam.

Categories
Admin DNS

The IT Detective Agency: is the guy rebooting our Linux server from France?

Intro
Sometimes you get a case that turns out to be silly and puts a smile on your face. This one is hardly worth the effort to write it up, but as I want to portray the life of an IT professional in all its glory and mundacity it’s worth sharing this anecdote.

The details
Yesterday I was migrating some SLES 11 VMs from an old VMWare server to a new platform. A shutdown was required, which I did. I noticed these particular systems had been barely used, hadn’t been booted in over a half year (and hence hadn’t been patched in that long). I guess my enterprise monitoring system actually works because not many minutes after the migration I get a call from the sysadmin, Fautt. He noticed the reboot, but he also noticed something else – a strange IP address associated with the reboot.

He says, over the phone, that the IP is 2.6.32.54 plus some other numbers. Of course when you’re listening over the phone it’s hard to both concentrate on the numbers and understand the concept at the same time. So he threw a giant red herring my way and said that that IP is blah-blah-.wanadoo.fr. Wanadoo.fr caught my ear. I recognize that as an ISP in France, with, like most ISPs, a somewhat dodgy reputation (in my opinion) for sending spam. So you see he led me down this path and I took the bait. I said I would look into it. The obvious underlying yet unspoken concern is this: if someone from France is rebooting my servers we have been completely compromised. So this was potentially very serious. Yet I knew I ahd been the one to reboot so I wasn’t actually panicked.

I ran the commands that I knew he must have run to see what he had seen. They look like this:

> last

reboot   system boot  2.6.32.54-0.3-de Thu Oct 11 14:03          (18:49)
drjohns  pts/0        sys1234.drjohns. Thu Oct 11 13:43 - down   (00:06)
root     pts/0        fauttsys.drjohns Wed Oct 10 11:16 - 11:32  (00:16)
...
root     :0           console          Wed Feb 15 15:13 - 15:55  (00:41)
root     :0                            Wed Feb 15 15:13 - 15:13  (00:00)
root     tty7         :0               Wed Feb 15 15:13 - 15:54  (00:41)
reboot   system boot  2.6.32.54-0.3-de Wed Feb 15 15:07         (33+19:43)
fauttdr  pts/1        fauttsys.drjohns Wed Feb 15 14:38 - down   (00:21)

Actually he told me he had just done a last reboot, but this presentation above makes the illusion more dramatic!

So the last entries contain the IP of the system where the admin logged in from, or domain names if the IP could be reverse looked up in DNS.

But, much like a magician who draws your attention to where he wants it to go, it is all a clever illusion. If you run

> dig -x 2.6.32.54

as Fautt surely must have, you get

; <<>> DiG 9.6-ESV-R5-P1 <<>> -x 2.6.32.54
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 7990
;; flags: qr rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 0
 
;; QUESTION SECTION:
;54.32.6.2.in-addr.arpa.                IN      PTR
 
;; ANSWER SECTION:
54.32.6.2.in-addr.arpa. 42307   IN      PTR     ABordeaux-652-1-113-54.w2-6.abo.wanadoo.fr.
 
;; Query time: 3 msec
;; SERVER: 10.201.145.20#53(10.114.206.104)
;; WHEN: Fri Oct 12 09:02:21 2012
;; MSG SIZE  rcvd: 96

But I figured that if this was a hack, there might be some mention in Google concerning this notorious IP. I searched for 2.6.32.54. The first few matches talked about a linux kernel.

That was the duh moment! A quick

> uname -a

on the server to confirm:

Linux drjohns28 2.6.32.54-0.3-default #1 SMP 2012-01-27 17:38:56 +0100 x86_64 x86_64 x86_64 GNU/Linux

So clearly the reboot line in the last output was using a different convention than the other entries and was providing the kernel version!

I told Fautt and he was duly embarrassed but thought it was funny (there aren’t many good jokes in a sysadmin’s routine, apparently :)).

Case closed.

Conclusion
A good night’s rest is important. When you’re tired you’re much more prone to superficial thinking that allows you to be drawn into someone else’s completely wrong picture of events.

Categories
Linux Network Technologies SLES TCP/IP

Ethernet Bridging on the cheap. Fail. Then Success with OLTV

Intro
Some experiments just don’t work out. I became curious about a technology that has various names: ethernet bridging, wide-area VLANs, OTV, L2TP, etc. It looked like it could be done on the cheap, but that didn’t pan out for me. But later on we got hold of high-end gear that implements OTV and began to get it to work.

The details
What this is is the ability to extend a subnet to a remote location. How cool is that? This can be very useful for various reasons. A disaster recovery center, for instance, which uses the same IP addressing. A strategic decision to move some, but not all equipment on a particular LAN to another location, or just for the fun of it.

As with anything truly useful there is an open source implementation(s). I found openvpn, but decided against it because it had an overall client/server description and so didn’t seem quite what I had in mind. Openvpn does have a page about creating an ethernet bridging setup which is quite helpful, but when you install the product it is all about the client/server paradigm, which is really not what I had in mind for my application.

Then I learned about Astaro RED at the Amazon Cloud conference I attended. That’s RED as in Remote Ethernet Device. That sounded pretty good, but it didn’t seem quite what we were after. It must have looked good to Sophos as well because as I was studying it, Sophos bought them! Asataro RED is more for extending an ethernet to remote branch offices.

More promising for cheapo experimentation, or so I thought at the time, is etherip.

Very long story short, I never got that to work out in my environment, which was SLES VM servers.

What seems to be the most promising solution, and the most expensive, is overlay transport virtualization (OLTV or simply OTV), offered by Cisco in their Nexus switches. I’ll amend this post when I get a chance to see if it worked or not!

December Update
OTV is beginning to work. It’s really cool seeing it for the first time. For instance, I have a server in South Carolina on an OTV subnet, IP 10.94.45.2. Its default gateway is in New Jersey! Its gateway is in the ARP table, as it has to be, but merely to PING the gateway produces this unusual time lag:

> ping 10.94.45.1

PING 10.94.45.1 (10.194.54.33) 56(84) bytes of data.
64 bytes from 10.94.45.1: icmp_seq=1 ttl=255 time=29.0 ms
64 bytes from 10.94.45.1: icmp_seq=2 ttl=255 time=29.1 ms
64 bytes from 10.94.45.1: icmp_seq=3 ttl=255 time=29.6 ms
64 bytes from 10.94.45.1: icmp_seq=4 ttl=255 time=29.1 ms
64 bytes from 10.94.45.1: icmp_seq=5 ttl=255 time=29.4 ms

See those response times? Huge. I ping the same gateway from a different LAN but same server room in New Jersey and get this more typical result:

# ping 10.94.45.1

Type escape sequence to abort.
Sending 5, 64-byte ICMP Echos to 10.94.45.1, timeout is 2 seconds:
!!!!!
Success rate is 100 percent (5/5), round-trip min/avg/max = 1/0/1 ms
Number of duplicate packets received = 0

But we quickly stumbled upon a gotcha. Large packets were killing us. The thing is that it’s one thing to run OTV over dark fiber, which we know another customer is doing without issues; but to run it in an MPLS network is something else.

Before making any adjustment on our servers we found behaviour like the following:
– initial ssh to linux server works OK; but session soon freezes after a directory listing or executing other commands
– pings with the -s parameter set to anything greater than 1430 bytes failed – they didn’t get returned

So this issue is very closely related to a problem we observed on a regular segment where getvpn had just been implemented. That problem, which manifested itself as occasional IE errors, is described in some detail here.

Currently we don’t see our carrier being able to accommodate larger packets so we began to see what we could alter on our servers. On Checkpoint IPSO you can lower the MTU as follows:

> dbset interface:eth1c0:ipmtu 1430

The change happens immediately. But that’s not a good idea and we eventually abandoned that approach.

On SLES Linux I did it like this:

> ifconfig eth1 mtu 1430

In this platform, too, the change takes place right away.

By the we experimented and found that the largest MTU value we could use was 1430. At this point I’m not sure how to make this change permanent, but a little research should show how to do it.

After changing this setting, our ssh sessions worked great, though now we can’t send pings larger than 1402 bytes.

The latest problem is that on our OTV segment we can ping only one device but not the other.

August 2013 update
Well, we are resourceful people so yes we got it running. Once the dust settled OTV worked pretty well, with certain concessions. We had to be able to control the MTU on at least one side of the connection, which, fortunately we always could. Load balancers, proxy servers, Linux servers, we ended up jiggering all of them to lower their MTU to 1420. For firewall management we ended up lowering the MTU on the centralized management station.

Firewalls needed further voodoo. After pushing policy clamping needs to be turned back on and acceleration off like this (for Checkpoint firewalls):

$ fw ctl set int fw_clamp_tcp_mss 1
$ fwaccel off

Conclusion
Having preserved IPs during a server move can be a great benefit and OTV permits it. But you’d better have a talented staff to overcome the hurdles that will accompany this advanced technology.

Categories
Admin Linux ntp SLES

The IT Detective Agency: ntp server shows the wrong time after patching

Intro
One of my ntp servers hadn’t been patched in awhile so it was time. Other systems rely on it for time synchronization. The next morning after the patching I noticed that the ntp service wasn’t even running. I started it and went about my business. Checking back some minutes later, it had died again. What happened, and how to get it fixed? Read on to see how we diagnosed and solved this puzzler.

The details
I like to use ntpq -p to query my ntp server – it’s easy to type! So when I started it up the results looked like this:

> ntpq -p

     remote           refid      st t when poll reach   delay   offset  jitter
==============================================================================
*LOCAL(0)        .LOCL.          10 l   59   64   17    0.000    0.000   0.001
 drjegw.drjn.com 192.5.41.209     2 u   56   64   17    0.335  3605497  26.136
 drjegw2.drjn.co 192.5.41.41      2 u   56   64   17   19.241  3605534  39.621
 drjeuro.drjn.eu 128.252.19.1     2 u   60   64   17  105.970  3605532  38.946

That’s some offset, eh? 3.605 x 10^6 msec, or, when you think about it, just over an hour. And yet the local clock had no offset. Strange.

Date
I like to do a crude check of system time by running the date command – quickly – on two different systems. Lacking some sleep, I noticed eventually but not right away, that my ntpd server had a date that was retarded by almost exactly an hour. I didn’t notice it at first because I had trained myself to only look at the seconds, which were “only” off by five seconds.

I checked to see if the timezone or localization settings had been changed by the patching – they hadn’t. So I went ahead and advanced the system clock by an hour. Actually the yast GUI of SLES gave me the option to sync against a time server, so I chose my closest one and did that after I had stopped ntpd.

Next problem, please
That got the time in the ballpark. But ntpd still wasn’t behaving. It exhibited a strange behaviour I’ve never seen before – its offset kept increasing. I observed this behaviour over the course of several minutes:

> ntpq -p

     remote           refid      st t when poll reach   delay   offset  jitter
==============================================================================
 LOCAL(0)        .LOCL.          10 l    3   64  377    0.000    0.000   0.001
*drjegw.drjn.com 192.5.41.209     2 u  129  128  377    0.350  146.846  81.771
+drjegw2.drjn.co 192.5.41.41      2 u    1  128  377   20.211  183.047  97.286
+drjeuro.drjn.eu 128.252.19.1     2 u   72  128  377  104.931  161.696  79.561

> ntpq -p

     remote           refid      st t when poll reach   delay   offset  jitter
==============================================================================
 LOCAL(0)        .LOCL.          10 l    5   64  377    0.000    0.000   0.001
*drjegw.drjn.com 192.5.41.209     2 u    2  128  377    1.803  182.380  97.636
+drjegw2.drjn.co 192.5.41.41      2 u    3  128  377   20.211  183.047  97.286
+drjeuro.drjn.eu 128.252.19.1     2 u   74  128  377  104.931  161.696  79.561

> ntpq -p

     remote           refid      st t when poll reach   delay   offset  jitter
==============================================================================
 LOCAL(0)        .LOCL.          10 l   28   64  377    0.000    0.000   0.001
*drjegw.drjn.com 192.5.41.209     2 u   89  128  377    1.803  182.380  97.636
+drjegw2.drjn.co 192.5.41.41      2 u   90  128  377   20.211  183.047  97.286
+drjeuro.drjn.eu 128.252.19.1     2 u   32  128  377  104.667  197.864  96.296

> ntpq -p

     remote           refid      st t when poll reach   delay   offset  jitter
==============================================================================
 LOCAL(0)        .LOCL.          10 l    5   64  377    0.000    0.000   0.001
*drjegw.drjn.com 192.5.41.209     2 u    4  128  377    1.813  218.325 113.345
+drjegw2.drjn.co 128.118.25.5     2 u    5  128  377   19.667  219.077 113.157
+drjeuro.drjn.eu 128.252.19.1     2 u   75  128  377  104.667  197.864  96.296

Look at that offset column. See? It keeps going up, at about a rate of 40 msec every two minutes. It ain’t supposed to do that!

So a Unix pal of mine said he had encountered an issue in ntp and had commented out that local clock. I honestly had absolutely no idea what that LOCAL line did, but it had never hurt before.

The local clock comes from these lines in ntp.conf:

server 127.127.1.0              # local clock (LCL)
fudge  127.127.1.0 stratum 10   # LCL is unsynchronized

So I took those out, stopped ntpd with a sudo service ntp stop, synced the time with a sudo sntp -P no -r drjegw.drjn.com, and restarted ntpd. It didn’t work immediately, but it became apparent eventually that it was working.

Meantime I discovered the ntpdc command, which is kind of informative in this situation:

> ntpdc
ntpdc> loopinfo

offset:               0.097373 s
frequency:            -132.558 ppm
poll adjust:          12
watchdog timer:       841 s

This tells me the offset if 97 msec (already too large in my experience) and that for some reason the system clock hadn’t been adjusted in 841 s, and that the clock drift rate was -132 ppm – much, much higher than any other system

Then in a few minutes it clicked and got the offset in order:

ntpdc> loopinfo

offset:               0.000000 s
frequency:            -132.558 ppm
poll adjust:          4
watchdog timer:       11 s

So removing the local system clock seemed to be working. But what was the real cause of all this? I discussed it with an admin. Bear in mind that this is physical server. He said the system clock gets its time from a hardware clock which should be visible in the ILO. We checked it. Sure enough, there it was, reporting in the ILO – still, after we had fixed the problem at the OS level – as one hour retarded. There was no way to manually adjust it. The only option was to set up sntp servers, which we did, which forced the ILO to restart.

We logged back in to the ILO and voila, the time was right!

I now realize that in the OS the LOCAL Time was using that hardware clock, which must have drifted by an hour since the system was installed.

Before the patch the system was incrementally keeping up with the drift, making the necessary incremental changes periodically. But the discrepancy was too large for it after rebooting after the patching. In the /var/log/ntp I even see a line:

14 Sep 07:20:53 ntpd[10259]: time correction of 3605 seconds exceeds sanity limit (1000); set clock manually to the correct UTC time.

Conclusion
Now the system is better and we have:

ntpdc> loopinfo

offset:               0.057029 s
frequency:            -5.465 ppm
poll adjust:          4
watchdog timer:       24 s

That’s better, but the offset of 57 msec is still far larger than normal. But it’s useable for now.