Admin Network Technologies

The IT Detective Agency: virus updates are failing

This case hardly qualifies as worthy subject matter for the IT Detective Agency – it’s pretty run-of-the-mill stuff. But I wanted to document it for completeness and show how a problem in one thing can turn out to have an unexpected cause (At least to me. In hindsight it’s dead obvious what the issue was likely to have been).

The Situation
We have lots of servers at drjohns. So when one of our admins, Shake, said that one of them, nfuz01, can’t reach the Etrust serverto get its virus updates I had no recollection of what that server is or does. Shake asked if the firewall had changed recently. That’s sort of a tricky question because there are always minor changes being done. Most have absolutely no effect because they are additional rules providing new permissions. So I bravely answered No, it hadn’t. And I wondered what he meant in using the word “reach” anyways.

So I walk up to Shake’s desk to get a better idea. He said not only are updates not virus signature updates not occurring, but neither server can PING the other, neither by name nor by IP address. Now we’re getting somewhere. I still haven’t registered where nfuz01 is, but I know the firewall as I’ve set it up permits ICMP traffic to transit. I suggested that maybe nfuz01 had some missing or messed-up routes. Then I went back to my desk to think some more. That’s what gets me motivated – when I’ve publicly speculated about the root cause of something. It’s not so much that I may be proven wrong, but if I am wrong, I want to be the first to find out and issue a correction.

So I tried a PING from my desktop:

Pinging with 32 bytes of data:
Reply from TTL expired in transit.
Reply from TTL expired in transit.

I look up where nfuz01 is. It is in a secondary data center. I ping it from a server in that same data center, but one a different segment – it works fine! I ping it from a Linux server in my main data center – totally different results:

> ping
PING ( 56(84) bytes of data.
From icmp_seq=1 Time to live exceeded
From icmp_seq=2 Time to live exceeded
--- ping statistics ---
2 packets transmitted, 0 received, +2 errors, 100% packet loss, time 1000ms
> traceroute -n
traceroute to (, 30 hops max, 40 byte packets
 1  1.100 ms  1.523 ms  2.010 ms
 2  0.934 ms  0.941 ms  0.773 ms
 3  0.869 ms  0.926 ms  0.913 ms
 4  1.076 ms  1.096 ms  1.130 ms
 5  1.043 ms  1.029 ms  1.018 ms
 6  0.993 ms  0.611 ms  0.918 ms
 7  0.932 ms  0.916 ms  1.002 ms
 8  0.987 ms  1.089 ms  1.121 ms
 9  1.152 ms  1.246 ms  1.229 ms
10  2.040 ms  2.747 ms  2.735 ms
11  1.332 ms  1.418 ms  1.467 ms
12  1.477 ms  1.754 ms  1.685 ms
13  1.993 ms  1.978 ms  2.013 ms
14  1.930 ms  1.960 ms  2.039 ms
15  2.065 ms  2.156 ms  2.140 ms
16  2.116 ms  5.454 ms  5.453 ms
17  4.466 ms  4.385 ms  4.296 ms
18  4.266 ms  4.267 ms  4.260 ms
19  4.232 ms  4.216 ms  4.216 ms
20  4.182 ms  4.063 ms  4.009 ms
21  3.994 ms  3.987 ms  2.398 ms
22  2.400 ms  2.484 ms  2.690 ms
23  2.346 ms  2.449 ms  2.544 ms
24  2.534 ms  2.607 ms  2.610 ms
25  2.602 ms  2.742 ms  2.736 ms
26  2.776 ms  2.856 ms  2.848 ms
27  2.648 ms  3.185 ms  3.291 ms
28  3.236 ms  3.223 ms  3.235 ms
29  3.219 ms  3.277 ms  3.377 ms
30  3.363 ms  3.381 ms  3.449 ms

Cool, right? We’ve caught a network loop in the act. Now I know it isn’t the firewall, it isn’t the routes on nfuz01 but it is something with networking. So I sent that off to them….

In less than an hour I got the explanation as well as the fix:

All should be reachable again. There’s a loop I can’t clear amongst some [telecom-owned] routers in the main data center. I’ve superseded it with two /27s until they clear it.

And it pings fine now:

> ping
PING ( 56(84) bytes of data.
64 bytes from icmp_seq=1 ttl=125 time=46.2 ms
64 bytes from icmp_seq=2 ttl=125 time=40.1 ms
64 bytes from icmp_seq=3 ttl=125 time=23.1 ms
--- ping statistics ---
3 packets transmitted, 3 received, 0% packet loss, time 2000ms
rtt min/avg/max/mdev = 23.114/36.502/46.275/9.796 ms

And Shake says the updates came in.

Case closed!

Why wasn’t the problem more obvious to us from the very beginning? Well, if the admin who said nfuz01 couldn’t reach Etrust had tried to log in nfuz01 through the normal Remote Desktop mechanism – and of course failed – then we might have drilled down into a networking cause more quickly. But nfuz01 is a VM and he must have been logged on via VMWare Virtual Center and so he hadn’t noticed that basically the server couldn’t reach anywhere in our main data center. It is also an obscure server (remember that I had no recall about it?) so no one really noticed that it was effectivley out-of-business.

Admin IT Operational Excellence Proxy

The IT Detective Agency: the case of the Sales and Use tax software

I have to give credit to my colleague “Ben” for cracking this case, which left me scratching my head. Users at drjohns were getting new Windows 7 PCs and some of the old software wasn’t going to run on those new PCs, including our indirect tax sales and use software from Thomson Reuters. The new approach is SaaS – software as a service. The new package was approved and everyone thought it was going to work fine, until late in the game it was actually tested. They couldn’t bring up their old tax returns. So at the last hour they bring in the Internet experts.

The Details
At drjohns our users are insulated from the Internet by proxy servers. There are no direct routes. It’s private address space and an explicit proxy connection to browse out to Internet. 99% of the time this works fine. And it sure is a secure way to go. But those exceptions can be quite a headache. This case is a very typical presentation of what we see, though the particular solution varies case by case.

We get detailed network requirements. They usually talk about opening up the firewall to certain servers, etc. We always patiently explain that the firewall is open – to the proxy! The desktops have no Internet routes, nor can they resolve Internet domain names. That’s right we have private root DNS servers. Most vendors have never encountered this setup and so they dig in their heels and insist that the only way is to ‘open the firewall…”

This case was no different, except we didn’t actually talk to the vendor. But their requirements were crystal clear in this networking document. Here’s the snippet that would seem to be fatal given our Intranet architecture:

The ONESOURCE Sales & Use Application Servers use TCP/IP communications from the client
PC to the Server. The requirements for communications with the ONESOURCE Application
Servers are itemized below:
- DNS Name Resolution is not used for the Application Servers.
- Proxy Server access to the Application Servers is supported ONLY in transparent mode. The
Proxy Server must not translate the TCP/IP address of the Application Servers. PCs must be
able to establish a connection using the actual TCP/IP address and port numbers of the
Application Server without application “awareness” of a Proxy Server.
- Network Address Translation (NAT) is supported for the client addresses but is NOT
supported for the Application Server addresses.
- Connections are outbound only from the client to the server.
- Security policies, firewall rules, proxy rules and router packet filters must allow outbound
connections (and inbound replies) on destination port 2429 to the Class “B” network address when using the non-WCF application servers. If the client’s account has been
configured to use Windows Communications Foundation or WCF, there are no additional port
requirements. The source port selection uses standard port numbers 1024 and above.

The application installs about 10 ActiveX controls and it wouldn’t run on my desktop. Ben managed to get it to run using the OpenText socks client. It has an option to “socksify everything else” which he says proves to be very useful when you don’t know what specific application to socksify. So now let me repeat what I have just said: Ben got it to work without any changes to the firewall, ignoring all the vendor’s advice and requirements!

I was very pleased as this was getting to be a high-priority issue what with these sales and use taxes due each month.

But Ben didn’t stop there. He came up with even better solution. He said he was looking around at the folder where all the stuff is installed by the application. He noticed a file called ConfigProxy. He configured it to use the system proxy settings. Then he exempted the target site from proxy authentication. Lo and behold that worked as well, with no socksification required at all. We only socksify an app as a last resort.

This latest finding completely contradicts the vendor’s stated network requirements. But it’s better this way.

We now have a happy tax department. Case closed.

Vendor network requirements are not always what they seem. Clearly they are not testing in the more obscure environments such as a private Intranet with an independent namespace that connects to the wider Internet only via explicit proxy. If you’re in this situation, which offers some serious security advantages, there are things you can do to get demanding applications to work.

Admin IT Operational Excellence Network Technologies

The IT Detective Agency: the case of the Adobe form network issue

Sometimes IT is called in to fix things we know little or nothing about. We may fix it, still not know anything, except what we did to fix it, and move on. Such is the case here with a mysterious Adobe Form that wasn’t working when I was called in to a meeting to discuss it.

The Case
One of our developers created a simple Adobe Acrobat form document. It has some logic to ask for a username and password and then verify them against an LDAP server using a network connection. Or at least that was the theory. It worked fine and then we got new PCs running Windows 7 and it stopped working. They asked me for help.

I asked to get a copy of the form because I like to test on my desktop where I am free to try dumb things and not waste others’ time. Initially I thought I also had the error. They showed me how to turn on Javascript debugging in edit|preferences. The debug window popped up with something like this:

debug 5 : function setConstants
debug 5 : data.gURL =
debug 5 : function today_ymd
debug 5 : mrp::initialize: version 0.0001 debug level = 5
debug 5 : Login clicked
debug 5 : calling LDAPQ
debug 5 : in LDAPQ
NotAllowedError: Security settings prevent access to this property or method.

But this wasn’t the real problem. In this form you get a yellow bar at the top along with this message and you give approval to the form to access what it needs. Then you run it again.

For me, then, it worked. I knew this because it auto-populated some user information fields after taking a few seconds.

So i worked with a couple people for whom it wasn’t working. One had Automatically detect proxy settings checked. Unfortunately the new PCs came this way and it’s really not what we want. We prefer to provide a specific PAC file. With the auto-detect unchecked it worked for this guy.

The next guy said he already had that unchecked. I looked at his settings and confirmed that was the case. However, in his case he mentioned that Firefox is his default browser. He decided to change it back to Internet Explorer. Then he tested and lo and behold. It began to work for him as well!

When it wasn’t working he was seeing an error:

NetworkError: A network connection could not be created.

Later he realized that in Firefox he also was using auto-detect for the proxy settings. When he switched that to Use System Settings all was OK and he could have FF as default browser and get this form to work.

This is speculation on my part. I guess that our new version of Acrobat Reader X, v 10.1.1, is not competent in interpreting the auto-detect proxy setting, and that it is also tripped up by the proxy settings in Firefox.

There’s a lot more I’d like to understand about what was going on here, but sometimes speed counts. The next problem is already calling my name…

Admin Internet Mail Linux SLES

Building sendmail on SLES

My sendmail binary built for SLES 10 SP 3 was not working well at all on SLES 11 SP1. It became apparent that libraries were not compatible so it was time to re-compile. I’ve documented that journey here. There were a few pitfalls along the way so I felt it was worth a blog post should anyone else ever need to do this.

February 2013 update
And now I’ve repeated the journey on SLES 11 SP2 – and ran into new problems! I’ll put that story in the appendix below.

Why Build?
Why build sendmail when you can find a package for it? For security it’s a good idea to run the latest version. It’s easier to defend during an audit. So when I look via zypper, I see it proposes me sendmail v 8.14.3:

# sudo zypper if sendmail
Refreshing service 'nu_novell_com'.
Loading repository data...
Reading installed packages...
Information for package sendmail:
Repository: SLES11-SP1-Pool
Name: sendmail
Version: 8.14.3-50.20.1
Arch: x86_64
Vendor: SUSE LINUX Products GmbH, Nuernberg, Germany

I go to and find that the latest version is actually 8.14.5. And that’s fairly typical. The distributed release is over a year old.

Where this can really matter is when a vulnerability comes out. If you can roll your own you can be the first on the block with that vulnerability fixed – not putting yourself at the mercy of a vendor busy with hundreds of other distractions. And I have seen this phenomenon in action.

Distributing Your Build
I’m mixing up the order here.

Once you have your sendmail built, what’s the minimum set of things you need to put it on other servers?

For one, you gotta have a database package with sendmail. For historical reasons I use sleepycat (I think formerly known as Berkeley db). Only it was gobbled up by Oracle. I don’t have a feel for what the future holds, though I fear decline. Sleepycat provides db. I grabbed the RPM from


This package has to go on each server where you will run sendmail if you are using db for your database. First issue in trying to install this RPM:

# sudo rpm -i db-4.7.25-1rt.x86_64.rpm
        file /usr/bin/db_archive from install of db-4.7.25-1rt.x86_64 conflicts with file from package db-utils-4.5.20-95.39.x86_64
        file /usr/bin/db_checkpoint from install of db-4.7.25-1rt.x86_64 conflicts with file from package db-utils-4.5.20-95.39.x86_64
        file /usr/bin/db_deadlock from install of db-4.7.25-1rt.x86_64 conflicts with file from package db-utils-4.5.20-95.39.x86_64
        file /usr/bin/db_dump from install of db-4.7.25-1rt.x86_64 conflicts with file from package db-utils-4.5.20-95.39.x86_64
        file /usr/bin/db_hotbackup from install of db-4.7.25-1rt.x86_64 conflicts with file from package db-utils-4.5.20-95.39.x86_64
        file /usr/bin/db_load from install of db-4.7.25-1rt.x86_64 conflicts with file from package db-utils-4.5.20-95.39.x86_64
        file /usr/bin/db_printlog from install of db-4.7.25-1rt.x86_64 conflicts with file from package db-utils-4.5.20-95.39.x86_64
        file /usr/bin/db_recover from install of db-4.7.25-1rt.x86_64 conflicts with file from package db-utils-4.5.20-95.39.x86_64
        file /usr/bin/db_stat from install of db-4.7.25-1rt.x86_64 conflicts with file from package db-utils-4.5.20-95.39.x86_64
        file /usr/bin/db_upgrade from install of db-4.7.25-1rt.x86_64 conflicts with file from package db-utils-4.5.20-95.39.x86_64
        file /usr/bin/db_verify from install of db-4.7.25-1rt.x86_64 conflicts with file from package db-utils-4.5.20-95.39.x86_64

I don’t really know if any of these conflicting files are used by the system. I also don’t know how to install “all but the conflicting files” in RPM. So we’ll try our luck and simply overwrite them:

# sudo rpm -i –force db-4.7.25-1rt.x86_64.rpm

Now no errors are reported.

We gotta make another kludge. sendmail, or at least the version I compiled, needs this db library in /usr/lib64, but the RPM puts it in /usr/lib. So…

# cd /usr/lib64; sudo ln -s ../lib/

By the way, how did I decide that db has to be brought over? I fired up sendmail and got this error:

# sendmail
sendmail: error while loading shared libraries: cannot open shared object file: Error 40

Now with the sym link made I get a more pleasant application error:

# sendmail
can not chdir(/var/spool/clientmqueue/): Permission denied
Program mode requires special privileges, e.g., root or TrustedUser.

Continuing my out-of-order documentation(!), I create an init script in /etc/init.d and make sure sendmail is going to be started at boot:

# sudo chkconfig -s drjohnssendmail 35

I like to put the logs in /maillog:

# sudo rm /var/log/mail; sudo mkdir !$; sudo ln -s !$ /maillog

I like to have the logging customized a bit, so I modify syslog-ng.conf somewhat:

# DrJ attempt to define filter based on match of sm-mta
filter f_sm-mta     { match("sm-mta"); };
filter f_fs-mta     { match("fs-mta"); };


#destination mail { file("/var/log/mail"); };
#log { source(src); filter(f_mail); destination(mail); };
destination mailwarn { file("/var/log/mail/mail.warn" perm(0644)); };
log { source(src); filter(f_mailwarn); destination(mailwarn); };
destination mailerr  { file("/var/log/mail/mail.err" perm(0644) fsync(yes)); };
log { source(src); filter(f_mailerr);  destination(mailerr); };
# and also all in one file:
# and also all in one file:
destination mail { file("/var/log/mail/stat.log" perm(0644)); };
log { source(src); filter(f_sm-mta); destination(mail); };
log { source(src); filter(f_fs-mta); destination(mail); };

Followed by a

# sudo service syslog restart

Going Back to Compiling
So how did we compile sendmail in the first place, which was supposed to be the subject of this blog?

We downloaded the latest version from and unpacked it. Then we read the INSTALL file in the sendmail-8.14.5 directory for general guidance about the steps.

To make our ocmpilation configuration portable, we try to encapsulate our idiosyncracies in one file, devtools/Site/site.config.m4, which we create. Mine looks as follows:

dnl  DrJ config file for corporate mail delivery
dnl I am leaving out the ldap stuff because we stopped using it
dnl which maps we will support - NEWDB is automatic if it finds the db libs
dnl in Linux NDBM is really GDBM, which isn't supported.  NEWDB support is not automatic
define(`confMAPDEF',`-DNEWDB -DMAP_REGEX')
dnl Berkeley DB is here...
dnl this doesn't work, exactly - put the db libs directly into /usr/lib
APPENDDEF(`confLIBDIRS',`-L/usr/local/ssl/lib -L/usr/lib64')
dnl libdb-4 looks like the sleepycat library on Linux
APPENDDEF(`confINCDIRS',`-I/opt/local/include -I/usr/local/ssl/include')
dnl where to put smrsh and mail.local programs
define(`confEBINDIR', `/opt/mail/bin')
dnl where to install include files
define(`confINCLUDEDIR', `/opt/mail/include')
dnl where to install library files
define(`confLIBDIR', `/opt/mail/lib')
dnl man pages
define(`confMANROOT', `/opt/local/man/cat')
dnl unformatted man pages
define(`confMANROOTMAN', `/opt/local/man/man')
dnl the sendmail binary goes into MBINDIR
define(`confMBINDIR', `/opt/mail/bin')
dnl programs only executed by root go to sbin
define(`confSBINDIR', `/opt/mail/sbin')
dnl shared library directory
define(`confSHAREDLIBDIR', `/opt/mail/lib')
dnl user-executable pgms, newaliases, mailq, hoststat, etc
define(`confUBINDIR', `/opt/mail/bin')
dnl TLS support 
APPENDDEF(`conf_sendmail_LIBS', `-lssl -lcrypto')

I’ll explain a bit of this file since of course it is critical to what we are trying to do. I arrived at its current state from a little experimentation, so I don’t know the full explanation of some of the settings. What it’s saying is that we use the NEWDB, which uses that db we spoke of earlier for our maps. I like to install the binaries into /opt/mail/bin. We like to have the option to run TLS.

With that set we run the compile:

# cd sendmail; sh ./Build

and it spits out some errors to the screen which indicate we’re missing some SSL headers. We get them with:

# sudo zypper source-install openssl

Now with any luck it will fully compile.

At this point it helps to create some of the target directories:

# sudo mkdir -p /opt/mail/{bin,sbin} /opt/local/man/cat{1,5,8}

And we create a user account, smmsp, with uid and gid of 225, and a group with the same name.

And then we can install it:

# sudo sh ./Build install

The install should work, mostly. But makemap doesn’t get put in /opt/mail for some reason. So you have to copy it by hand from sendmail-8.14.5/obj.Linux. to /opt/mail/sbin, for instance. You really need to have a makemap.

Finally, I suggest to recursively copy the cf directory:

# cp -r sendmail-8.14.5/cf /opt/mail

This way you have a pretty relocatable set of files under /opt/mail.

Appendix: Building on SLES 11 SP2
I thought this would be a breeze. Just look at my own blog posting above! Not so fast – that approach didn’t work at all.

You have to appreciate that under SLES 11 SP1 we needed a few key packages that aren’t very common:

– libopenssl-devel-0.9.8h-30.27.11
– zlib-devel-1.2.3-106.34

We pulled them from the SDK DVD. Well, turns out there is no SDK DVD for SLES 11 SP2! Novell, probably in one of those beloved cost-saving measures, no longer releases an SDK DVD. What to do?

Well, I found what not to do. I began copying key files from my SLES 11 SP1 installation, like libcrypto.a, libssl.a, /usr/include/ssl. This all helped to reduce the number of errors. But at the end of the day there was an error I couldn’t chase away no matter what:

/usr/src/packages/BUILD/openssl-0.9.8h/crypto/comp/c_zlib.c:235: undefined reference to `inflate'

Meanwhile the resourceful sysadmin found those development packages. He told me about SUSE Studio, which allows you to build your own distribution. He looked for those development packages in the distribution, found and installed them:

$ rpm -qa|grep devel


Then my build went through fine. Whew!

I could go on and on about details in the setup, but the scope here is the compilation, and we’ve covered that pretty well.

Once you get over the pain of compilation setup, sendmail runs great and is a great MTA.

Admin IT Operational Excellence Linux Proxy Web Site Technologies

The IT Detective Agency: intermittent web page not found error

One of the high arts of IT is system integration, and an important off-shoot of this is acquisitions. We are involved in integrating a new location, which, unfortunately, we do not yet have full access to. The local networking is still provided by their vendor, not ours and this makes troubleshooting all the more difficult.

The Details
So the word begins to spread that users at this site are having intermittent problems accessing some of our secure web sites. As it was described to me, they can load the page in their browser for, say five straight times, get a simple Internet Explorer cannot display the web page error, and the sixth time (or whenever) it will load properly. All other connectivity was working. No one else at other locations was having this problem with this web site. More than strange, right?

In drjohn’s perfect IT world, problem reproducibility is critical to resolution, but we simply didn’t have it this time. I also could not produce the problem myself, which means relying on other people.

I’m not sure if we tried to contact their vendor or not at first. But if we had I’m sure they would have denied having anything to do with it.

So we got one of our confederates, Tim, over to this location and we hooked him up with Wireshark so he could get take a packet trace when the failure occurs. It wasn’t long before Tim reproduced the error and emailed us the packet capture.

In the following the PC has IP address, the web server is at The Linux command used to look at the capture file is:

# tcpdump -A -r bodega-error.cap port 443 > /tmp/dump

1 15:54:27.495952 IP > S 2803722614:2803722614(0) win 64240 <mss 1460,nop,wscale 0,nop,nop,sackOK>
2 15:54:27.496309 IP > S 3201081612:3201081612(0) ack 2803722615 win 5840 <mss 1432,nop,nop,sackOK>
3 15:54:27.496343 IP > . ack 1 win 64240
4 15:54:27.497270 IP > P 1:82(81) ack 1 win 64240
5 15:54:27.497552 IP > . ack 82 win 5840
6 15:54:30.743827 IP > P 1:286(285) ack 82 win 5840
..S.......^M..i.P.......HTTP/1.0 200 OK^M
Cache-Control: no-store^M
Pragma: no-cache^M
Cache-Control: no-cache^M
X-Bypass-Cache: Application and Content Networking System Software 5.5.17^M
Connection: Close^M
7 15:54:30.744036 IP > F 82:82(0) ack 286 win 63955
8 15:54:30.744052 IP > F 286:286(0) ack 82 win 5840
9 15:54:30.744077 IP > . ack 287 win 63955
10 15:54:30.744289 IP > . ack 83 win 5840

The output was scrubbed a bit of meaningless junk characters and I added serial packets numbers in the beginning by hand because I don’t (yet) know how to do that with tcpdump!

What, It’s Encrypted – what can you even learn from a trace?
Yeah, an SSL stream sure adds to the already steep challenges we faced in this problem. There just isn’t much to work with. But it is something. I’m about to say what I noticed in this packet trace, but for it to be meaningful you need to know like I did that the web server is situated almost four thousand miles from the user’s location.

The first packet is a SYN from the PC to web server on TCP port 443. So far so good. In fact packets one – three constitute the three-way handshake in TCP.

Although SSL is encrypted, the beginning of the protocol communication should show the SSL cipher being chosen. Unfortunately, tcpdump doesn’t seem to have the smarts to show any of this. So I got myself ssldump. On Ubuntu:

# sudo apt-get install ssldump

did the trick. Then run this same capture file through ssldump, which has very similar arguments to tcpdump:

# ssldump -r bodega-error.cap port 443

New TCP connection #1: <->
1 1  0.0013 (0.0013)  C>S SSLv2 compatible client hello
  Version 3.1
  cipher suites
  Unknown value 0xff
Unknown SSL content type 72
1    3.2480 (3.2467)  C>S  TCP FIN
1 2  3.2481 (0.0000)  S>CShort record
1    3.2481 (0.0000)  S>C  TCP FIN

The way to interpret this is that 0.0013 s into the TCP port 443 communication the cipher suites listed above were sent by the PC to the server. This corresponds to our packet number 4 in the trace file.

Using Wireshark to look at the trace is a lot more convenient – it provides packet numbers, timing, decodes packets and displays the SSL ciphers. But I wanted to show that it _could_ be done with text-based tools.

Look at the timings more closely. In the tcpdump output, packet 2, the SYN ACK, comes 1 ms after the SYN. But given the distances involved between PC and server, the SYN ACK should have come more like 100 ms later, at least. Similarly packet 5, which is an ACK, comes less than 1 ms after packet 4. A 1 ms ACK? Physically impossible.

I have seen this behaviour before – on our own load balance – which I know employs some TCP optimization tricks. So I concluded that they must have physically present at this site some kind of appliance which is doing TCP optimization. It can only provide blank ACKs in its rapid-fire responses since it can’t know what data the server is really going to respond with. That might all be OK. But I’m pretty sure the problem lies between packets 5 and 6. 5 is one of those meaningless rapid-fire empty ACKs generated by the local router. But the PC has just sent a wish list of SSL ciphers in packet 4. It needs to be responded to by the server which has to finish setting up the SSL session.

But that critical packet from the server never arrives. Perhaps even some of the SSL handshake is secretly completed between the local router and the server. Who knows? I have heard of man-in-the-middle devices that decrypt SSL sessions. And packet 6 contains fairly inappropriate content. It almost does look like it has been manufactured by a man-in-the-middle device. Its telling the browser to do a redirect to the same site, except specified by IP address rather than FQDN. And that doesn’t make a lot of sense. The browser likely realizes that this amounts to a looping redirect request so at that point it probably decides to cut its losses and FIN the connection in packet 7.

I traced my own PC hitting this same web server. Now I know we don’t have any of these optimizing devices between me and the web server. I don’t have time to show the results here, but to summarize, it looks rather completely different from the trace above. The ACK packets come back in about 100 ms or so. There is no delay of three seconds. The cipher proposals are responded to in a timely fashion. There is no redirect.

Their Side of the Story
We did get to hear back from the vendor who supports the LAN/WAN. They said they were running WCCP and diverting traffic to a proxy server. This was the correct behaviour before we hooked our infrastructure to this site, but is no longer. They realized this was probably a bad thing and took corrective action to turn off WCCP for destinations in the internal network

Shutting off WCCP, which diverted web site requests to an old proxy server, fixed the problem.

Case closed.

Unsolved Mysteries
I wish we could tie all the loose ends neatly up, but there are too many players involved. We’ll never really know why the problem was intermittent, for instance. Or why some secure web sites could be accessed without any issue whatsoever throughout this ordeal.

WCCP, Web cache Communication Protocol, is a Cisco-developed routing protocol to transparently intercept traffic destined for web servers. More information can be found on it in wikipedia.

It bothers me that after the SSL session was initiated the dump showed the source, unencrypted, of the HTML redirect packet. Why wasn’t that encrypted? Perhaps the WCCP-invoked proxy server was desperately trying to help the PC recover from an unrecoverable situation and manufactured that HTTP-EQUIV REFRESH… to try to force the PC to choose a web site that might work. The fact that it was sent unencrypted over a channel that should have been encrypted was probably even the death bell that triggered the browser to think this makes no sense at all and is even a violation of security, I’m getting out of here.

Admin Internet Mail Linux

The IT Detective Agency: generated email goofs attachments

Today was a busy day! A rather expert user asks for my advice about emails being generated from his own Unix system, by his own processes, that occasionally come out with the encoding of the attachments showing up in the body of the message.

I am not a message formatting expert, or at least I wasn’t prior to this question today. Bbut if a sendmail expert can’t provide an answer, who will? So anyways, this user, let’s call him Rob forwards me this email:

Dear DrJohn,
I was wondering if you have some idea as to why sometimes the attachment 
is getting included in the text of the email instead of being recognized as attachment, 
see example below.
Any pointers would be helpful, as this makes it at the least cumbersome 
to open the attachment by cutting , pasting the text attachment part 
as file and uudecoding it into binary before it can be opened for content view.
Mime-Version: 1.0
Content-Type: multipart/mixed;
X-Mailer: sendmsg
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 7bit
Dear Team;
While executing the flow, In.process:RapidProductMovementReport_New, the following exception occurred.
java.lang.Exception: java.sql.SQLException:ORA-12899: value too large for column "B2B_RAPID"."P_MOVEMENT_DETAIL"."UNIT_OF_MEASURE_TYP" (actual: 3, maximum: 2)
Error Dump :
com.wm.lang.flow.FlowException:java.lang.Exception: java.sql.SQLException:ORA-12899: value too large for column ... (actual: 3, maximum: 2)
Pipeline values (see attachment)
Caller: Rapid_In.process:RapidRouter
Stack: %serviceStack%
Content-Type: application/gzip; name=pipeline.xml.gz
Content-Transfer-Encoding: base64
Content-Disposition: attachment; filename=pipeline.xml.gz

Hmm. So what do you think? I have often seen these Mime-type headers and paid absolutely no attention to them. When things work what does it matter? But now they’re not working and it does matter.

First I tried to reproduce the problem using a technique I had gleaned looking over the shoulders of some Unix admins. I knew they had an easy method to send attachments from the command line of their Unix systems so I walked over and asked them how they did it. Sure enough, it was dead easy. It goes something like this:

# uuencode file.txt file.txt|mailx -s “here is your attachment” recipient_address

I rolled up my sleeves, set recipient_address to a valid SMTP mail address. Sure enough. I got it as an attachment in Lotus Notes. But it’s an ugly attachment and doesn’t have any of the nice MIME formatting about it. so it’s probably a bit of luck that my MUA (mail user agent) understands that I mean to create an attachment. I don’t think all MUAs will do that, unless it’s following a more obscure RFC which I’m not aware of.

The original source of the message looks like this (my attachment is called cogstartup.xml.gz):

Date: Fri, 2 Dec 2011 14:54:34 -0500
From: ...
Message-Id: ...
To: ...
Subject: test using uuencode
X-MIMETrack: Itemize by SMTP Server on ... (Release 7.0.4|March 23, 2009) at
 02/12/2011 20:54:36,
		 Serialize by Notes Client ... (Release 8.5.1|September
 28, 2009) at 12/02/2011 04:12:40 PM,
		 Serialize complete at 12/02/2011 04:12:40 PM
X-TNEFEvaluated: 1
begin 644 cogstartup.xml.gz

Very different from what we saw above, right? None of those MIME-related headers seem to be present. So I decided that uuencode is too primitive to reproduce the problem.

I was next going to try to generate an email with the help of a PERL module, like MIME::Lite. But I decided that was too much trouble as I would need to download it and install it first!

So I ventured to see if I could get lucky and figure out the problem by educating myself on the standard, without wasting too much time. The relevant RFC seems to be 1341. I prefer the older RFCs because I suppose they’re shorter – easier to understand because life was simpler in those days! Once yuo parse through the verbiage and repitition, there wasn’t much to it. In particular, it mentioned that the Header

Mime-Version: 1.0

has to be included amongst the header fields. If it is not, the correct behaviour for a MUA is to interpret all the encodings and stuff as just part of a regular body text, which it will display to the user.

I ran a test using sendmail as my sending MUA from a Linux server. With the sendmail agent you can add headers, at least that’s how I remember it:

# sudo sendmail -v recipient< tst where tst is a file I created that starts with the line Mime-Version: 1.0 Yes, indeedy. I received it and my MUA interpreted the attachment as an attachment, displaying the attachment name and the appropriate icon type for it! Now put just a blank line at the top of my tst file, pushing all the rest of the stuff down by a line, and the behaviour is completely different. Then my MUA treats everything as literal body text, just as the old RFC says it must, and it looks just the way it did when Rob forwarded it to me. Conclusion
I explained to Rob that he must sometimes be introducing an extra line above the Mime-Verison header, which would cause this problem.

He thanked me.

Case closed!

Admin IT Operational Excellence

The IT Detective Agency: SAP Afaria communication to APNS failed

IT is often called in and expected to produce results when only vague or partial information is presented. This is especially so when integrating commercial software.

The Issue
We were looking to implement SAP SUP (Sybase Unwired Platform). Part of it – for mobile device management for iOS devices – requires an Afaria server to communicate with the Apple Push Notification Service,, port 2195; and, port 2196.

Whatever software it is that’s running on Afaria, it’s not well-behaved insofar as it does not support proxy access. That means it expects to be on the Internet and be able to initiate these communication channels unencumbered.

Since I consider that to be a security risk I looked for a way to mediate this communication over a proxy server anyways, even though it wasn’t supposed to work.

Yes, we did get it to work, but not exactly in the way we expected.

We built TCP tunnels on the proxy.

To be continued…

Admin IT Operational Excellence Linux SLES

The IT Detective Agency: Cognos stopped working

Here’s another in our continuing exciting IT drama. A user reports that her Cognos app stopped working. She’s in charge of the Cognos application servers, I run the Cognos gateway on a Linux server. I have almost no working knowledge of Cognos. I learned just enough to get the gateway installed and configured on Linux, specifically SLES. Cognos is used for business intelligence reports and is now owned by IBM.

The Details
The home page came up just fine, so I knew the web server – Apache, of course – was working. I know I hadn’t changed anything on the gateway. She also says that she hadn’t changed anything on the dispatcher. So she asks me to save the config. It’s an X application. I run, which by the way is in COGNOS-INSTALL_DIR/bin64, not COGNOS-INSTALL_DIR/bin, contrary to the documentation for Linux. I cannot save the config. She asks me to export it. I can’t do that either! I get the error

CAM-CRP-1057 unable to generate the machine specific symmetric key.

She asks me to delete the keypairs. These are in the directories COGNOS-INSTALL_DIR/configuration/{signkeypair,encryptkeypair}. So I clear out those. Still I cannot save or export the configuration. I quickly switch to a Solaris server which we had hoped to retire in order to get a working gateway while we mulled the problem over.

Over the next days I checked to see if Java had changed. Getting a working JRE was a little tricky on SLES. Nothing had changed. After the system admin came back from vacation the next week I asked if by chance. The last log showed he was logged in at the time. He admits to changing one thing.

He changed the system name. This system has multiple interfaces and a unique hostname for each interface. The hosts file in /etc/hosts included entries for each of the interface IPs. Seeing there were no other changes I concluded that this little innocent act was enough to kill the communication. Note that he did not change any of the routing, however. When you’re dealing with encryption, it can be that the system name is significant. So when those keys were initially generated they were tied to that name and would only work with that original hostname. At least that is my reverse engineering of the matter. Cognos is a pretty closed system so it’s hard to pin down more precisely what is going on.

The hostname was changed back to the original name. Sure enough, now I can export the config and most importantly, save it without any errors.

Case closed!

Lessons Learned
Well, avoiding finger-pointing and quick judgements was helpful in this case. Of course I suspected she actually had done something to the dispatcher, but I behaved as though the problem might be on my side. We treated each other professionally while the system was down and we had no clue why. That was very helpful.

Admin IT Operational Excellence Linux SLES

The IT Detective Agency: the case of the messages from mars

Today we got a “funny” message on our SLES 11 server in the /var/log/warn file. You might think that Martians have landed!

The Details
Specifically this:

Nov 9 10:54:19 drjohn24 kernel: [72397.088297] martian source from, on dev eth1
Nov 9 10:54:19 drjohn24 kernel: [72397.088300] ll header: 78:e7:d1:7b:25:32:00:a0:8e:a8:8e:b3:08:00

Every time I pinged (drjohn24) from it would produce those two lines in the warn and messages file. More worrisome, I could not ssh from one host to the other. I could ssh from a host on the local network to drjohn24. We observed this behaviour even with the firewall disabled. Strange, right?

One more thing to note: drjohn24 has two network interfaces and various routes defined.

The Solution
It didn’t take too long to get to the bottom of this. We set up the routes wrong. We meant to create a default route out of eth0, which was right, and a net-10 route for eth1, which we specified incorrectly. Do

netstat -rn

to show all routes. I had this:

Kernel IP routing table
Destination     Gateway         Genmask         Flags   MSS Window  irtt Iface U         0 0          0 eth1   U         0 0          0 eth0     U         0 0          0 eth0 UG        0 0          0 eth1       U         0 0          0 lo         UG        0 0          0 eth0

Do you see the error? We put the mask on the the same as we put on the interface and that’s not what we wanted.

The corrected version looks like this:

Destination     Gateway         Genmask         Flags   MSS Window  irtt Iface
...       UG        0 0          0 eth1

So what was happening is that the inbound packet from was arriving at eth0 as we intended. But SLES 11 is now clever enough to realize, based on its routing table, that that is not the expected interface where a packet with that source IP should arrive. It should have arrived at eth1 because of the default route. No other static route was more specific for due to our error. And apparently even with firewall turned off, SLES gets very defensive at this point. I’m not sure if it was sending return packets out of eth1 or not, because I kept looking for them out of eth0!

Once we corrected the routes the inbound packet arrived at eth0 and was returned with an answer packet from eth0 and the martian messages went away.

The martian message thing is a little obscure, and at the time more a distraction than anything else as we had to research what that meant. I guess for the future we’ll instantly know. It’s very similar to defining network topology on your firewalls in an anti-spoofing defense.

Case closed!


Running Multiple Web Server Instances on SLES

My initial plan was to run multiple virtualhosts under one global umbrella. But my virtualhosts have to do diverse things. One will front-end JBoss, another will front-end IBM WebSphere, etc. I thought or rather hoped that the differences could all be expressed in the separate virtualhosts sections. But I’ve been stopped dead in my tracks with JBoss when I tried that:

Syntax error on line 28 of /usr/local/apache2/conf/mod-jk.conf:
JkWorkersFile cannot occur within  section

Hmm. So JkWorkersFile has to appear in the global section. I’m not comfortable with that. I guess it’s time to consider running separate processes for my different classes of service. How best to do that?

To be continued…