Categories
IT Operational Excellence Uncategorized

The IT Detective Agency: Debugging a Thorny Citrix Connection Issue

This case begins with the observation by the application owner for Citrix XenApp. External users were being knocked out of their sessions frequently – several times a day. And it happened en masse. Before this problem users were typically logged in all day. You can see that many must have been bumped around 12:30 PM then again around 2 PM. The problems began July 5th.

The users suffering the disconnects were all external users who access the applications via a Citrix Secure Gateway. The XenApp servers being accessed are also used on the Intranet and those users were not seeing any drops.

The AO asked if I had changed anything in the network. Nope. Had he changed anything? Nope.

So now we have the classic stand-off, right? AO vs network. There’s a root cause and it’s either the AO or the network guy who’s ultimately at fault.

My attitude in these cases is the following: the network person should prove it’s an application problem and the application owner should prove it’s a network problem! It sounds cynical, but this approach aligns with the best interests of each party. Both are really working towards the same goal, but preserving their own interests. E.g., the networking person thinks that If I can prove it’s an application problem then the AO will quit bothering me and I can get back to my real job. After all, I am not knowledgeable about the application. Even if it is a networking issue, I do not know where the issue is so I need the AO to point out the problem at a detailed level, e.g., the dropped packet or whatever, so I can focus my energies. The reality in my experience is quite different however. The AO typically does not know enough about networking to make this proof.

Nonetheless I proceeded this way, hoping to prove some knid of application problem so I could get back to my normal activities.

We enabled my own PC to use the application. This is always much easier than bothering other people. I can take traces to my heart’s content! So Monday I was connected to XenApp via the CSG. I was going along fine until 11:35 when I got the disconnected message! I later learned that the bulk of users, who are using a different app, were not disconnected then, but were at about an hour later.

Now there’s lots of pieces to look at, any one of which could be at fault. Working from PC on Internet to the XenApp we have: The Internet, my Internet router, firewall, load balancer, CSG server, firewall, XenApp server. That’s a lot to look after, but you have to start somewhere. I chose the load balancer. It was rather confusing, even to establsih a baseline of “normal” activity. I quickly observed that every 30 seconds packets were being transmitted to the PC even when nothing was going on.. Of course the communication was all encrypted so I did not even attempt to look into the packets. But sometimes I saw seven packets, sometimes six, and more rarely different numbers. The packet order didn’t even make sense. Sometimes the load balancer responded to the XenApp before the PC did! The trace of this behaviour until I was disconnected will be shown here when I get the time to include it:

The end of the trace shows a bunch of FIN packets. FIN is used to terminate a TCP connection. Now we’re getting somewhere. It looks like, from a TCP perspective, that a more-or-less orderly shutdown of the connection was occuring. If confirmed that would point to an application problem and life would be good!

The next day I logged into CSG and used a XenApp app again. This time I did an additional trace and included the CSG server itself. Again I was disconnected after a few hours. In this trace the CSG server is called webservera, the XneApp server is xenapp15. This is not a byte-level trace but rather running snoop on Solaris and looking at the meta-data:

________________________________
11:29:23.81577 xenapp15 -> webservera    ETHER Type=0800 (IP), size = 60 bytes
11:29:23.81577 xenapp15 -> webservera    IP  D=10.201.142.41 S=10.201.88.34 LEN=46, ID=4842, TOS=0x0, TTL=126
11:29:23.81577 xenapp15 -> webservera    TCP D=56011 S=1494 Push Ack=1900995851 Seq=4198089996 Len=6 Win=64134
________________________________
11:29:25.01881 xenapp15 -> webservera    ETHER Type=0800 (IP), size = 60 bytes
11:29:25.01881 xenapp15 -> webservera    IP  D=10.201.142.41 S=10.201.88.34 LEN=46, ID=4844, TOS=0x0, TTL=126
11:29:25.01881 xenapp15 -> webservera    TCP D=56011 S=1494 Push Ack=1900995851 Seq=4198089996 Len=6 Win=64134
________________________________
11:29:27.42530 xenapp15 -> webservera    ETHER Type=0800 (IP), size = 60 bytes
11:29:27.42530 xenapp15 -> webservera    IP  D=10.201.142.41 S=10.201.88.34 LEN=46, ID=4861, TOS=0x0, TTL=126
11:29:27.42530 xenapp15 -> webservera    TCP D=56011 S=1494 Push Ack=1900995851 Seq=4198089996 Len=6 Win=64134
________________________________
11:29:30.87645    webservera -> xenapp15 ETHER Type=0800 (IP), size = 54 bytes
11:29:30.87645    webservera -> xenapp15 IP  D=10.201.88.34 S=10.201.142.41 LEN=40, ID=847, TOS=0x0, TTL=64
11:29:30.87645    webservera -> xenapp15 TCP D=1494 S=56011 Ack=4198090002 Seq=1900995851 Len=0 Win=48871
________________________________
11:29:30.87657    webservera -> xenapp15 ETHER Type=0800 (IP), size = 54 bytes
11:29:30.87657    webservera -> xenapp15 IP  D=10.201.88.34 S=10.201.142.41 LEN=40, ID=848, TOS=0x0, TTL=64
11:29:30.87657    webservera -> xenapp15 TCP D=1494 S=56011 Ack=4198090002 Seq=1900995851 Len=0 Win=48871
________________________________
11:29:33.02325    webservera -> xenapp15 ETHER Type=0800 (IP), size = 54 bytes
11:29:33.02325    webservera -> xenapp15 IP  D=10.201.88.34 S=10.201.142.41 LEN=40, ID=849, TOS=0x0, TTL=64
11:29:33.02325    webservera -> xenapp15 TCP D=1494 S=56011 Ack=4198090002 Seq=1900995851 Len=0 Win=48871
________________________________
11:29:53.34945 xenapp15 -> webservera    ETHER Type=0800 (IP), size = 60 bytes
11:29:53.34945 xenapp15 -> webservera    IP  D=10.201.142.41 S=10.201.88.34 LEN=40, ID=4923, TOS=0x0, TTL=126
11:29:53.34945 xenapp15 -> webservera    TCP D=56011 S=1494 Rst Ack=1900995851 Seq=4198090002 Len=0 Win=0

What I saw this time is that RST packet was being sent from the XenApp server! That’s the very last line, which I will repeat here for emphasis since it is so important to the case:

11:29:53.34945 xenapp15 -> webservera    TCP D=56011 S=1494 Rst Ack=1900995851 Seq=4198090002 Len=0 Win=0

TCP RST is a way to immediately disconnect a connection! It seemed as though this was begin converted to a FIN by the CSG. Now it’s looking very much like for whatever reason the application decided to terminate the connection. It almost has to be an application problem, right?

Wrong! We have to keep an open mind.

This trace, while dense, hints at where the problem may lie. It is taken on the load balancer with tcpdump -i 0.0. The load balancer has two interfaces, one towards the Internet, the other towards webserverw. The hostname of the load balancer’s Internet interface is called CSG, the hostname of the Citrix client on the Internet is drjohnspc.

11:03:59.392810 802.1Q vlan#4093 P0 drjohnspc.20723 > webserverw.https: . ack 35 win 32768 (DF)
11:03:59.455730 802.1Q vlan#4093 P0 webserverw.https > drjohnspc.20723: P 35:68(33) ack 1 win 48677 (DF)
11:03:59.554810 802.1Q vlan#4093 P0 drjohnspc.20723 > webserverw.https: . ack 68 win 32768 (DF)
11:03:59.585845 802.1Q vlan#4094 P0 drjohnspc.20723 > CSG.https: . ack 35 win 64426 (DF)
11:03:59.585855 802.1Q vlan#4094 P0 CSG.https > drjohnspc.20723: P 35:68(33) ack 1 win 32768 (DF)
11:03:59.885805 802.1Q vlan#4094 P0 drjohnspc.20723 > CSG.https: . ack 68 win 64393 (DF)
11:04:59.465070 802.1Q vlan#4093 P0 webserverw.https > drjohnspc.20723: P 68:103(35) ack 1 win 48677 (DF)
11:04:59.465080 802.1Q vlan#4094 P0 CSG.https > drjohnspc.20723: P 68:103(35) ack 1 win 32768 (DF)
11:04:59.564818 802.1Q vlan#4093 P0 drjohnspc.20723 > webserverw.https: . ack 103 win 32768 (DF)
11:05:00.664812 802.1Q vlan#4094 P0 CSG.https > drjohnspc.20723: P 68:103(35) ack 1 win 32768 (DF)
11:05:02.864812 802.1Q vlan#4094 P0 CSG.https > drjohnspc.20723: P 68:103(35) ack 1 win 32768 (DF)
11:05:07.064810 802.1Q vlan#4094 P0 CSG.https > drjohnspc.20723: P 68:103(35) ack 1 win 32768 (DF)
11:05:15.264811 802.1Q vlan#4094 P0 CSG.https > drjohnspc.20723: P 68:103(35) ack 1 win 32768 (DF)
11:05:31.464812 802.1Q vlan#4094 P0 CSG.https > drjohnspc.20723: P 68:103(35) ack 1 win 32768 (DF)
11:05:59.807514 802.1Q vlan#4093 P0 webserverw.https > drjohnspc.20723: P 103:130(27) ack 1 win 48677 (DF)
11:05:59.807741 802.1Q vlan#4093 P0 webserverw.https > drjohnspc.20723: F 130:130(0) ack 1 win 48677 (DF)
11:05:59.807754 802.1Q vlan#4093 P0 drjohnspc.20723 > webserverw.https: . ack 131 win 32768 (DF)
11:05:59.807759 802.1Q vlan#4094 P0 CSG.https > drjohnspc.20723: FP 103:130(27) ack 1 win 32768 (DF)
11:06:03.664813 802.1Q vlan#4094 P0 CSG.https > drjohnspc.20723: . 68:103(35) ack 1 win 32768 (DF)
11:06:12.642847 802.1Q vlan#4093 P0 drjohnspc.20723 > webserverw.https: R 1:1(0) ack 131 win 32768 (DF)
11:06:12.642862 802.1Q vlan#4094 P0 CSG.https > drjohnspc.20723: RP 103:130(27) ack 1 win 32768 (DF)

Notice the time stamp increasing by larger and larger leaps beginning with 11:05:00.664812. 11:05:02, 11:05:07, 11:05:15, 11:05:31 – the time keeps doubling! this is characteristic of a TCP retransmit. Note that all the other information is the same. It must be retransmitting the same packet. Why? Because it never got there! That seems to be the most likely reason. Now my conviction and hope that an application problem lies at the heart of the issue is starting to crumble. See why you need to keep an open mind? Your opinion can change to the polar opposite conclusion with the input of some additional data like that. Where to turn next?

There is a firewall inbetween the load balancer and the Internet. Now we will focus our attention on it. Could be that it dropped that packet and all the re-transmits.

Here’s the trace of that same conversation on the firewall’s internal interface (which faces the CSG) (I(O) means inbound(outbound) with respect to that interface):

11:04:59.441022  I IP CSG.https > drjohnspc.20723: P 2210302227:2210302262(35) ack 1714160833 win 32768
11:05:00.640781  I IP CSG.https > drjohnspc.20723: P 0:35(35) ack 1 win 32768
11:05:02.840742  I IP CSG.https > drjohnspc.20723: P 0:35(35) ack 1 win 32768
11:05:07.040729  I IP CSG.https > drjohnspc.20723: P 0:35(35) ack 1 win 32768
11:05:15.240780  I IP CSG.https > drjohnspc.20723: P 0:35(35) ack 1 win 32768
11:05:31.440571  I IP CSG.https > drjohnspc.20723: P 0:35(35) ack 1 win 32768
11:05:59.783366  I IP CSG.https > drjohnspc.20723: FP 35:62(27) ack 1 win 32768
11:06:03.640595  I IP CSG.https > drjohnspc.20723: . 0:35(35) ack 1 win 32768
^C

and the trace of the same thing on the firewall’s external interface, i.e., facing the Internet and drjohnspc:

11:03:59.269334  O IP CSG.https > drjohnspc.20723: P 2210302159:2210302194(35) ack 1714160833 win 32768
11:03:59.562011  I IP drjohnspc.20723 > CSG.https: . ack 35 win 64426
11:03:59.562139  O IP CSG.https > drjohnspc.20723: P 35:68(33) ack 1 win 32768
11:03:59.861970  I IP drjohnspc.20723 > CSG.https: . ack 68 win 64393

Notice what’s not present in the exterenal interface trace – all those re-transmits, or even the original packet.

Let’s summarize so far. One of those keep-alive packets from the XenApp server reached the firewall, but didn’t exit the firewall. the only possibility is that it got dropped by the firewall!

Now that was a lot of work, but who’s going to do it if not a patient and methodical IT person?

Results
We got a networking problem on our hands after all. Good thing we persisted in this investigation even when it looked like we were off the hook! Later it was confirmed that the firewall was “aggressively aging” its connections because it had either reached or was very close to its connection limit. The firewall connection limit was raised and the Citrix connection issues went away.

Let’s go back to that simplistic question that non-experts like to ask: what had changed that caused this problem? The change was external events – increased usage of that firewall. Network bandwidth, Internet usage – they all tend to increase over time. There were no changes done by either networking or the application group to cause this issue. A seasoned IT detective uses all available clues and arrives at the right conclusion. The “what has changed” question is normally very relevant, but it can’t be the only tool in your toolbox!

Case closed!

Categories
IT Operational Excellence

Dr John’s Laws of IT

Laws

Here are some laws of IT based on years of observation.  Some of these have very real and practical consequences.

1. IT infrastructure decays over time if left to itself – a sort of entropy sets in.  This is sort of  counterintuitive insofar as people know enough about troubleshooting a problem to ask “what changed.”  Sometimes the answer is nothing at all, or nothing you would ever think of.  For example, I once had an application server start to fail when “nothing had changed.”  The cause, found after mny hairs pulled out?  The log file it wrote to tried to exceed 2 GB on a 32-bit system.  It couldn’t write to its log any longer and the app server just froze up. 

1.1) Corollary to 1.  Neglect works great in the short term, but the way to go is judicious maintenance!  Neglect leads to a 2 GB log in the first place!

2. Things will always go wrong at some point.  It will usually not be for the reason you suspect.

2.1 Corollary to 2.  Effective monitoring is critical.  If you build something critical, build a means to monitor it.  Monitor foundational components as well so that when you need it, you can see what all was working when one thing went south.

3. Software support from large vendors is abysmal.  Most small and mid-sized vendors are no better.  The premise of almost all support I’ve encountered is The customer did something wrong.  The most relevant metric is How quickly can the case be closed?  If in that rare case the customer can prove fault by the vendor, Justify doing Absolutely Nothing about it for as Long as Possible.  And NEVER do somthing immediately useful like let the customer speak to a software developer who actually knows what he/she is talking about.

3.1 Corollary to 3. An IT Professional quickly develops all the skills possessed by front-line engineering who responds to support calls, and can solve most of the problems on his/her own, our of necessity, since the assistance given won’t take the problem further anyways.

Observations

Laws are universal.  The following are key observations that are generally true.

1. The more an IT person thinks about a problem, the better the solution.  Better means cheaper, faster, more elegant, even moving the category from impossible to the possible (and actually this happens frequently believe it or not).

2. Even a brilliant IT professional won’t think up all solutions alone.  Creative problem solving occurs best when there’s a couple brilliant IT professionals bouncing ideas off each other, with a few others at the ready to contribute for specialist opinions.

3. How to estimate the amount of time for an IT project:

    2 x (extimate from experienced IT professional) + constant

If many groups and external partners are involved, the multiplier should be increased to 3 or even 4.

This sounds facetious but it is not.  It is the unfortunate truth of the nature of our work and the unpredictability of the showstopper moments which always occur.

4. All a seasoned IT person needs to decide the impossible is possible is to hear that someone else is doing it!  The creative juices start flowing at that point.  Maybe it’s a competitive thing at that point.

5. Large IT organizations contain a large number of people who actually know surprisingly little about IT.  Small IT organizations are also not immune from this.

Categories
Scams

Spam and Scams – What to Expect When You Start a Blog

In my case – not much! It appears that despite providing top-notch content the only “readers” are those trying to profit from me. To use the word “scam” may be a bit strong, but any outfit that demands money upfront to supposedly help you make money is highly suspect in my playbook.

So I’ve heard from Tina. It goes like this:
Admin – I’ve checked out http://drjohnstechtalk.com/blog/2011/06/grep-is-slow-as-a-snail-in-sles-11/ and I really like your writing style like in your post Grep is Slow as a Snail in SLES 11 | Dr John’s Tech Talk. I am looking for blog authors who would like to write articles as either a full time job or part time job (for some extra money). I think your writing style would work very well. You receive pay per article, anywhere from $5 to $50 per article depending on the topic, article length, etc… If interested you can find more information at www.onlinehomewriter.net.

Please do me a favor and do not follow that link. It redirects you secure.signup-way.com, some strange-looking URL that McAfee categorizes as Malicious Sites, High Risk. So I don’t think I’ll be going there.

Then there’s Tony:
Blog Admin – If your blog isn’t bringing in as much money as you would like it to check out my site www.QuickCashBlogging.com. We show blog owners how to maximize their blogs earnings potential. Tony

McAfee verdict: Spam site, medium risk. That’s just great.

The McAfee URL checker I use is http://www.trustedsource.org/en/feedback/url.

Clearly these people have program trolling the Internet for new domains and new blogs, trying to squeeze some $$ from them. Unfortunately I’m not sure any person who could benefit from the information has read my blogs. So I feel I am making negative progress – instead of elevating the level of discourse on the Internet helping it to be used for more spam and scams.

I just feel bad for humanity. Is this the best we can do? A well-meaning person embarks on a quixotic journey to provide better technical information on some topics, and the average response from my fellow human beings is to try to take advantage of a hopefully vulnerable and naive newbie? I am literally concerned for us as a race.

August 16th Update
The spam and scam started as a trickle. Now it’s raining spam in my inbox. I continue to be disappointed. In email the ratio of spam to “ham” may be about five to one, so not knowing any better you could expect a similar ratio with WordPress blogs. Not so! Of my fifty comments, not counting the ones from myself, the legit comments number about two-and-half, more like a twenty to one ratio. I will probably use a WordPress plugin to cut them off, but since I started on this public service mission, here are some more scams.

This one is spam as it was posted to my Sample page:
HTC is a well known name in the smartphone segment. The company has come up with smartphones boasting of exquisite features and HTC EVO 4G is one of the most potent… It came from an address ending in @mail.ru . One of many hats is as spam fighter. Let me tell you you see an a sender address mail.ru and you’re talking pure spam. The IP resolves to Latvia, however, and that fact hardly inspires confidence either.

To my post WordPress, Apache2, Permalinks and mod_rewrite under Ubuntu I got a comment The Best Way To Fix Acid Reflux. Now that’s a closely related topic!

Another one claims to help if I’m looking for information about babies (very relevant for a tech blog. yeah, right!).

Very many fall into the generic flattery category. Like this one:
Hey There. I found your blog using msn. This is a very well written article. I will be sure to bookmark it and come back to read more of your useful information. Thanks for the post. I’ll definitely return.

Or this:
I agree with your Gnu Parallel Really Helps With Zcat | Dr John's Tech Talk, great post.

I had to investigate those a little bit as I almost fell for one the first time. Then you realize that it’s so generic – except for the one where he obviously just pasted in the title of my post programatically – that it could be used for any blog post.

An equally popular scam are the SEO scams – Search Engine Optimization. I think the point of those scams is to shake a little money from you for supposed help to improve your blog’s ranking in the search engines.

Returning to the flattery scams, how do I know for sure this isn’t real, genuine flattery of my wonderful posts? I’ll tell you. There are a couple unambiguous clues and another strong hint.

Let’s start with the strong hint. Since I haven’t told anyone about my blog, pretty much the only way someone’s going to find it who has legitimate interest in its content is through Google or another search engine. So, in the web server access log, where I am recording the HTTP_REFERER (what URL the browser visited just before hitting my blog post), I should expect to see one of the search engines. I should not see some random web site mentioned because there is simply no good reason for browser controlled by a human being to go from someone else’s web site directly to my web site. And yet that is precisely what I am seeing. I would give examples but it would only serve to promote their web sites, so i will refrain from even an example.

But even more damning is to examine how long the poster has spent on my site. A human being has to read the post, contemplate its meaning, then type in a comment to finally post, right? It could rarely be done in under a minute. WordPress tells me the IP of the poster of the comment. I take that IP and search for it in the access log using grep. I am seeing that these comments are being made in one second after the web page was first downloaded. One second. It is not humanly possible. But for a program, piece of cake.

Here’s a real example (some of this may be cut off, depending on your browser):

109.169.61.16 - - [12/Aug/2011:06:31:51 -0400] "GET /blog/2011/06/gnu-parallel-really-helps-with-zcat/ HTTP/1.0" 200 22757 "http://blahblah.net/invest/" "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; .NET CLR 1.1.4322; .NET CLR 2.0.50727; .NET CLR 3.0.04506.30)"
109.169.61.16 - - [12/Aug/2011:06:31:51 -0400] "GET /blog/ HTTP/1.0" 301 340 "http://blahblah.net/invest/" "Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/532.2 (KHTML, like Gecko) Chrome/4.0.221.7 Safari/532.2"
109.169.61.16 - - [12/Aug/2011:06:31:51 -0400] "POST /blog/wp-comments-post.php HTTP/1.0" 302 902 "http://drjohnstechtalk.com" "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; .NET CLR 1.1.4322; .NET CLR 2.0.50727; .NET CLR 3.0.04506.30)"
109.169.61.16 - - [12/Aug/2011:06:31:52 -0400] "GET /blog/ HTTP/1.0" 200 110806 "http://blahblah.net/invest/" "Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/532.2 (KHTML, like Gecko) Chrome/4.0.221.7 Safari/532.2"

Also this example illustrates the other damning evidence of lack of human involvement in the comment. A real browser run by a real human being has to pull in a lot of objects to display a single WordPress page. You’ve got stylesheets, external javascript pages and even the image at the top. They should all be requested, and be recorded in the access log. But a programmatically controlled browser needs far less! It needs the HTML of the blog page, and then the page it POSTS the comment to. Perhaps a third page after the POST to show it the POST was successful or not. And that’s exactly what I’ve seen in all the spam/scam comments I’ve checked out by hand, not just the flattery scams. They are all using the absolute minimum page accesses and that simply screams non-human access! I am, unfortunately, not really so special as they would have me believe! And the SEO scams are just annoying advertising. Most of the rest is what I’d call link laundering, where they’re using the legitimacy of my site to try to get links to their shady sites included, by trickery, carelessness or any means. And some are just using it as pure spam to my inbox since that’s where the comments go for review and they don’t even care if I approve their spam for public viewing or not.

Possible Explanation
My hypothesis is that there are specially constructed advanced searches in Google you can do to find new WordPress blogs. You can download the results and programatically loop through them and attempt to post your spam and scams. It’s pretty easy to program a browser like curl using PERL to post to a WordPress blog. Even I could do it! And that low barrier to entry jibes with the level of professionalism I perceive in these scams, which is to say, pretty low, like something I would cook up by my lonesome! Misspellings, poor English, blatant calls-to-action are par for the course, as well as source IPs from remote regions of the world that have no possible interest in my arcane technical postings.

Now you could argue that a real browser could have cached some of those objects and so upon a return visit it might only access a minimum set of objects and hence look a bit like a program. To that I say that it is rarely the case that all objects get cached. And even if they did, you still have to take time to type in your comment, right? No one can do that in a second. The access lines above span the time from 6:31:51 to 6:31:52!!

The Final solution
I think I’ve made my point about the spam. I have followed Ryan’s advice and activated a plugin called Akismet. Their site looks fairly professional – like they know what they are doing. An API key is required to activate the plugin, but that is available for free for personal blogs. I’ll append to this blog whether or not it works!

Feb 28th update
600 spam comments later, 20 in the last few hours alone, I am sooo tired of rotten apples abusing the leave a comment feature, even though I am protected from approving the comments, it is still filling up my database. So I have taken an additional step today and implemented a Captcha plugin. This supposedly requires some human intelligence to answer a simple math problem before the post is allowed. I’ll post here about how well it is or isn’t working.

September update
Well, the captcha plugin has stopped virtually all spam, except one random comment. A user wishing to post a comment has to solve a very simple math/language problem. I recommend this approach. I suppose eventually the scammers will catch up with this defense, but in the meantime I am now enjoying peace and tranquility in my seldom-visited but formerly frequently spammed blog!