Categories
IT Operational Excellence Uncategorized

The IT Detective Agency: Debugging a Thorny Citrix Connection Issue

This case begins with the observation by the application owner for Citrix XenApp. External users were being knocked out of their sessions frequently – several times a day. And it happened en masse. Before this problem users were typically logged in all day. You can see that many must have been bumped around 12:30 PM then again around 2 PM. The problems began July 5th.

The users suffering the disconnects were all external users who access the applications via a Citrix Secure Gateway. The XenApp servers being accessed are also used on the Intranet and those users were not seeing any drops.

The AO asked if I had changed anything in the network. Nope. Had he changed anything? Nope.

So now we have the classic stand-off, right? AO vs network. There’s a root cause and it’s either the AO or the network guy who’s ultimately at fault.

My attitude in these cases is the following: the network person should prove it’s an application problem and the application owner should prove it’s a network problem! It sounds cynical, but this approach aligns with the best interests of each party. Both are really working towards the same goal, but preserving their own interests. E.g., the networking person thinks that If I can prove it’s an application problem then the AO will quit bothering me and I can get back to my real job. After all, I am not knowledgeable about the application. Even if it is a networking issue, I do not know where the issue is so I need the AO to point out the problem at a detailed level, e.g., the dropped packet or whatever, so I can focus my energies. The reality in my experience is quite different however. The AO typically does not know enough about networking to make this proof.

Nonetheless I proceeded this way, hoping to prove some knid of application problem so I could get back to my normal activities.

We enabled my own PC to use the application. This is always much easier than bothering other people. I can take traces to my heart’s content! So Monday I was connected to XenApp via the CSG. I was going along fine until 11:35 when I got the disconnected message! I later learned that the bulk of users, who are using a different app, were not disconnected then, but were at about an hour later.

Now there’s lots of pieces to look at, any one of which could be at fault. Working from PC on Internet to the XenApp we have: The Internet, my Internet router, firewall, load balancer, CSG server, firewall, XenApp server. That’s a lot to look after, but you have to start somewhere. I chose the load balancer. It was rather confusing, even to establsih a baseline of “normal” activity. I quickly observed that every 30 seconds packets were being transmitted to the PC even when nothing was going on.. Of course the communication was all encrypted so I did not even attempt to look into the packets. But sometimes I saw seven packets, sometimes six, and more rarely different numbers. The packet order didn’t even make sense. Sometimes the load balancer responded to the XenApp before the PC did! The trace of this behaviour until I was disconnected will be shown here when I get the time to include it:

The end of the trace shows a bunch of FIN packets. FIN is used to terminate a TCP connection. Now we’re getting somewhere. It looks like, from a TCP perspective, that a more-or-less orderly shutdown of the connection was occuring. If confirmed that would point to an application problem and life would be good!

The next day I logged into CSG and used a XenApp app again. This time I did an additional trace and included the CSG server itself. Again I was disconnected after a few hours. In this trace the CSG server is called webservera, the XneApp server is xenapp15. This is not a byte-level trace but rather running snoop on Solaris and looking at the meta-data:

________________________________
11:29:23.81577 xenapp15 -> webservera    ETHER Type=0800 (IP), size = 60 bytes
11:29:23.81577 xenapp15 -> webservera    IP  D=10.201.142.41 S=10.201.88.34 LEN=46, ID=4842, TOS=0x0, TTL=126
11:29:23.81577 xenapp15 -> webservera    TCP D=56011 S=1494 Push Ack=1900995851 Seq=4198089996 Len=6 Win=64134
________________________________
11:29:25.01881 xenapp15 -> webservera    ETHER Type=0800 (IP), size = 60 bytes
11:29:25.01881 xenapp15 -> webservera    IP  D=10.201.142.41 S=10.201.88.34 LEN=46, ID=4844, TOS=0x0, TTL=126
11:29:25.01881 xenapp15 -> webservera    TCP D=56011 S=1494 Push Ack=1900995851 Seq=4198089996 Len=6 Win=64134
________________________________
11:29:27.42530 xenapp15 -> webservera    ETHER Type=0800 (IP), size = 60 bytes
11:29:27.42530 xenapp15 -> webservera    IP  D=10.201.142.41 S=10.201.88.34 LEN=46, ID=4861, TOS=0x0, TTL=126
11:29:27.42530 xenapp15 -> webservera    TCP D=56011 S=1494 Push Ack=1900995851 Seq=4198089996 Len=6 Win=64134
________________________________
11:29:30.87645    webservera -> xenapp15 ETHER Type=0800 (IP), size = 54 bytes
11:29:30.87645    webservera -> xenapp15 IP  D=10.201.88.34 S=10.201.142.41 LEN=40, ID=847, TOS=0x0, TTL=64
11:29:30.87645    webservera -> xenapp15 TCP D=1494 S=56011 Ack=4198090002 Seq=1900995851 Len=0 Win=48871
________________________________
11:29:30.87657    webservera -> xenapp15 ETHER Type=0800 (IP), size = 54 bytes
11:29:30.87657    webservera -> xenapp15 IP  D=10.201.88.34 S=10.201.142.41 LEN=40, ID=848, TOS=0x0, TTL=64
11:29:30.87657    webservera -> xenapp15 TCP D=1494 S=56011 Ack=4198090002 Seq=1900995851 Len=0 Win=48871
________________________________
11:29:33.02325    webservera -> xenapp15 ETHER Type=0800 (IP), size = 54 bytes
11:29:33.02325    webservera -> xenapp15 IP  D=10.201.88.34 S=10.201.142.41 LEN=40, ID=849, TOS=0x0, TTL=64
11:29:33.02325    webservera -> xenapp15 TCP D=1494 S=56011 Ack=4198090002 Seq=1900995851 Len=0 Win=48871
________________________________
11:29:53.34945 xenapp15 -> webservera    ETHER Type=0800 (IP), size = 60 bytes
11:29:53.34945 xenapp15 -> webservera    IP  D=10.201.142.41 S=10.201.88.34 LEN=40, ID=4923, TOS=0x0, TTL=126
11:29:53.34945 xenapp15 -> webservera    TCP D=56011 S=1494 Rst Ack=1900995851 Seq=4198090002 Len=0 Win=0

What I saw this time is that RST packet was being sent from the XenApp server! That’s the very last line, which I will repeat here for emphasis since it is so important to the case:

11:29:53.34945 xenapp15 -> webservera    TCP D=56011 S=1494 Rst Ack=1900995851 Seq=4198090002 Len=0 Win=0

TCP RST is a way to immediately disconnect a connection! It seemed as though this was begin converted to a FIN by the CSG. Now it’s looking very much like for whatever reason the application decided to terminate the connection. It almost has to be an application problem, right?

Wrong! We have to keep an open mind.

This trace, while dense, hints at where the problem may lie. It is taken on the load balancer with tcpdump -i 0.0. The load balancer has two interfaces, one towards the Internet, the other towards webserverw. The hostname of the load balancer’s Internet interface is called CSG, the hostname of the Citrix client on the Internet is drjohnspc.

11:03:59.392810 802.1Q vlan#4093 P0 drjohnspc.20723 > webserverw.https: . ack 35 win 32768 (DF)
11:03:59.455730 802.1Q vlan#4093 P0 webserverw.https > drjohnspc.20723: P 35:68(33) ack 1 win 48677 (DF)
11:03:59.554810 802.1Q vlan#4093 P0 drjohnspc.20723 > webserverw.https: . ack 68 win 32768 (DF)
11:03:59.585845 802.1Q vlan#4094 P0 drjohnspc.20723 > CSG.https: . ack 35 win 64426 (DF)
11:03:59.585855 802.1Q vlan#4094 P0 CSG.https > drjohnspc.20723: P 35:68(33) ack 1 win 32768 (DF)
11:03:59.885805 802.1Q vlan#4094 P0 drjohnspc.20723 > CSG.https: . ack 68 win 64393 (DF)
11:04:59.465070 802.1Q vlan#4093 P0 webserverw.https > drjohnspc.20723: P 68:103(35) ack 1 win 48677 (DF)
11:04:59.465080 802.1Q vlan#4094 P0 CSG.https > drjohnspc.20723: P 68:103(35) ack 1 win 32768 (DF)
11:04:59.564818 802.1Q vlan#4093 P0 drjohnspc.20723 > webserverw.https: . ack 103 win 32768 (DF)
11:05:00.664812 802.1Q vlan#4094 P0 CSG.https > drjohnspc.20723: P 68:103(35) ack 1 win 32768 (DF)
11:05:02.864812 802.1Q vlan#4094 P0 CSG.https > drjohnspc.20723: P 68:103(35) ack 1 win 32768 (DF)
11:05:07.064810 802.1Q vlan#4094 P0 CSG.https > drjohnspc.20723: P 68:103(35) ack 1 win 32768 (DF)
11:05:15.264811 802.1Q vlan#4094 P0 CSG.https > drjohnspc.20723: P 68:103(35) ack 1 win 32768 (DF)
11:05:31.464812 802.1Q vlan#4094 P0 CSG.https > drjohnspc.20723: P 68:103(35) ack 1 win 32768 (DF)
11:05:59.807514 802.1Q vlan#4093 P0 webserverw.https > drjohnspc.20723: P 103:130(27) ack 1 win 48677 (DF)
11:05:59.807741 802.1Q vlan#4093 P0 webserverw.https > drjohnspc.20723: F 130:130(0) ack 1 win 48677 (DF)
11:05:59.807754 802.1Q vlan#4093 P0 drjohnspc.20723 > webserverw.https: . ack 131 win 32768 (DF)
11:05:59.807759 802.1Q vlan#4094 P0 CSG.https > drjohnspc.20723: FP 103:130(27) ack 1 win 32768 (DF)
11:06:03.664813 802.1Q vlan#4094 P0 CSG.https > drjohnspc.20723: . 68:103(35) ack 1 win 32768 (DF)
11:06:12.642847 802.1Q vlan#4093 P0 drjohnspc.20723 > webserverw.https: R 1:1(0) ack 131 win 32768 (DF)
11:06:12.642862 802.1Q vlan#4094 P0 CSG.https > drjohnspc.20723: RP 103:130(27) ack 1 win 32768 (DF)

Notice the time stamp increasing by larger and larger leaps beginning with 11:05:00.664812. 11:05:02, 11:05:07, 11:05:15, 11:05:31 – the time keeps doubling! this is characteristic of a TCP retransmit. Note that all the other information is the same. It must be retransmitting the same packet. Why? Because it never got there! That seems to be the most likely reason. Now my conviction and hope that an application problem lies at the heart of the issue is starting to crumble. See why you need to keep an open mind? Your opinion can change to the polar opposite conclusion with the input of some additional data like that. Where to turn next?

There is a firewall inbetween the load balancer and the Internet. Now we will focus our attention on it. Could be that it dropped that packet and all the re-transmits.

Here’s the trace of that same conversation on the firewall’s internal interface (which faces the CSG) (I(O) means inbound(outbound) with respect to that interface):

11:04:59.441022  I IP CSG.https > drjohnspc.20723: P 2210302227:2210302262(35) ack 1714160833 win 32768
11:05:00.640781  I IP CSG.https > drjohnspc.20723: P 0:35(35) ack 1 win 32768
11:05:02.840742  I IP CSG.https > drjohnspc.20723: P 0:35(35) ack 1 win 32768
11:05:07.040729  I IP CSG.https > drjohnspc.20723: P 0:35(35) ack 1 win 32768
11:05:15.240780  I IP CSG.https > drjohnspc.20723: P 0:35(35) ack 1 win 32768
11:05:31.440571  I IP CSG.https > drjohnspc.20723: P 0:35(35) ack 1 win 32768
11:05:59.783366  I IP CSG.https > drjohnspc.20723: FP 35:62(27) ack 1 win 32768
11:06:03.640595  I IP CSG.https > drjohnspc.20723: . 0:35(35) ack 1 win 32768
^C

and the trace of the same thing on the firewall’s external interface, i.e., facing the Internet and drjohnspc:

11:03:59.269334  O IP CSG.https > drjohnspc.20723: P 2210302159:2210302194(35) ack 1714160833 win 32768
11:03:59.562011  I IP drjohnspc.20723 > CSG.https: . ack 35 win 64426
11:03:59.562139  O IP CSG.https > drjohnspc.20723: P 35:68(33) ack 1 win 32768
11:03:59.861970  I IP drjohnspc.20723 > CSG.https: . ack 68 win 64393

Notice what’s not present in the exterenal interface trace – all those re-transmits, or even the original packet.

Let’s summarize so far. One of those keep-alive packets from the XenApp server reached the firewall, but didn’t exit the firewall. the only possibility is that it got dropped by the firewall!

Now that was a lot of work, but who’s going to do it if not a patient and methodical IT person?

Results
We got a networking problem on our hands after all. Good thing we persisted in this investigation even when it looked like we were off the hook! Later it was confirmed that the firewall was “aggressively aging” its connections because it had either reached or was very close to its connection limit. The firewall connection limit was raised and the Citrix connection issues went away.

Let’s go back to that simplistic question that non-experts like to ask: what had changed that caused this problem? The change was external events – increased usage of that firewall. Network bandwidth, Internet usage – they all tend to increase over time. There were no changes done by either networking or the application group to cause this issue. A seasoned IT detective uses all available clues and arrives at the right conclusion. The “what has changed” question is normally very relevant, but it can’t be the only tool in your toolbox!

Case closed!