Categories
Admin Network Technologies

The IT Detective Agency: the case of the Is it the firewall? or routing? or switch? or layer 2?

Intro
This is yet another tale of things in the IT world often do not turn out the way it seems at first blush. Or possibly a tale of just when you think you’ve seen it all after decades in the industry, something new (to you) occurs.

What’s going on
The firewall team was all busy so when this strange problem occurred Friday they called in the second string: me. I consider some of the team to be less-than-customer focused so I try to compensate for them and for my lack of knowledge about the firewall by applying a more customer-first attitude. In other words, a sympathetic listening ear. These days it can be hard just to find someone to complain to about your It problem, and I am keenly aware of that.

There was some strange communication which wasn’t working, mediated by a firewall I had never accessed and was not sure i even had access to. So of course I was asked to join a big conference call where an ongoing debugging session was taking place.

I refused.

I hate being blindsided, and i hate not having answers, making me sound even less competent than I already am.

But what I did do is being my research to see what the system is, if perhaps I had access, etc.

Yes. I found that through a management system I have access to I had access to view the policies on that particular firewall and view the logs as well.

So once I had that up, I agreed to join the call.

They had one server communicating to three different systems. Only one of the three systems was being reached. Yes the other two were on the same subnet. Two of our firewalls were between the system and the three servers.

And, yes, i could see some drops. The interesting TCP error stated: TCP packet out of state, first packet isn’t SYN.

No problem. routing must be screwed up such that we have asymmetric routing. It happens all the time. Right? well these systems are really appliances with only some basic networking information configurable, not real debugging facility, and really no ability to add a host route.

I could not establish a shell session onto the firewall – not sure what the password naming scheme was that they used.

Then a real firewall guy comes on the call. But his connectivity is messed up, so I keep with the debug session, if nothing else than to support him since four eyes is more effective than just two. He shares the routing tables of our two in-line firewalls. It’s hard to understand as these are all new subnets for me, some are ones that don’t look right. But just focusing on possible host routes for any of these three servers, I don’t see anything amiss.

Firewall policy
And, in firewall policy I see the entire subnet has this traffic permitted. There is no rule specific to one or the other of these systems.

So what do we have up until now?
A purist firewall administrator attitude would be as follows:
The firewall treats all these systems the same, therefore this cannot be a firewall problem. Talk to your networking or system people. Have a nice day.

Well, in fact there was some serious question about the network switch as well. So we had a network guy on the call. So they dug up the MAC addresses of these systems, from which they found the switch ports. Then they checked the port configuration. Ah, some complex 802.1x authentication was configured. As I understand this means the device would not even be allowed onto the subnet until it passed some kind of Radius authentication. So they removed this 802.1x stuff and just made sure that port was assigned to the right vlan.

Still, the problem persisted.

I think the other firewall guy was also new to this equipment. Eventually, though, he tries to do a packet trace of the one that’s working versus the one that isn’t.

You know, I never saw the results of those traces, but I’m pretty sure, reading between the lines, that they surprised him, meaning, they did not fit the hypothesis of the asymmetric routing.

In these situations there is the main communication in the mian session, then side communications going on, like between me and the firewall guy. But it is all chaotic. Acoustics are mediocre, accents are hard to understand. So the net transfer of information is pretty low. Statements, even important ones, often have to be repeated multiple times (rebroadcasts) to assure everyone “gets it.”

Typical questions were asked. When did this last work? what had changed? There were a couple changes. Some kind of networking thing (I forget what), and then the firewalls changed management systems after that. The firewall change seemed closer in time to the last known success.

You acquire more and more information as you dig into problems. It’s hard to judge which is relevant at the time and which lines if inquiry are a complete waste of time. A good incident manager or project manager can sense which are the more productive lines of investigation and nurture those discussions while suppressing the noise.

Actually it was the networking guy who found the Checkpoint link below. I looked at it. the firewall guy was muttering something about badly behaved, older applications that might exhibit this behaviour.

So we agreed to take the suggested steps, which would basically allow these out of state packets. Drat. The firewall returned an error.

But I continue to refresh the firewall logs. The communication was occurring about every minute. Lo and behold, I see the older drops, and then accepts for the last few minutes! I think it worked. I tell them to check.

They check their end. Sure enough. Communication beginning to work…

The customer tries to make assertion that this was a firewall problem all along. Not so fast. Firewall guy says, well, the firewall is doing exactly what it’s supposed to be doing. who’s right?

We’re all good for now, but we state this is a kludge for today and a follow-up meeting needs to occur.

So what happened?
I think the single most important thing is that the firewall guy switched his problem hypothesis from Must be asymmetric routing, to Maybe it’s a badly behaved application. Meaning what? What if you have an application that establishes a TCP connection, and then to beat idle timeouts, sends a KEEP ALIVE packet every minute? Well, now, suppose your firewall is rebooted in the middle of that because it has changed management stations and needs to reload policy? What might the situation look like to it?

It you were unlucky, it just might see these KEEP ALIVE TCP packets without having the connection in its connection table, in other words, exactly the situation we are observing!

What should have happened?
It would have been great if the communication were forced to be re-established form time-to-time, even once a day. This problem had been going on for days.

But, given this very stupid behaviour on the part of this application, if the app people had been aware they should have forced their application to re-establish the TCP connection after the firewall reboot. Probably, for the one that did work, it had been forced to re-establish.

A firewall person has to be sufficiently aware to realize this could be happening, and advise the app owner on what to do to prevent it.

Conclusion
So whose problem is it?

To the app people it looks like a firewall issue, cut-and-dried. To a firewall guy it looks like an application issue, cut-and-dried. I see both sides. It is some of both. An app owner has to understand enough about firewalls to see that this type of thing can occur. Assigning blame to one side or the other, as most people are wont to do, is not productive. Only a team effort could have revealed this issue. And recall that the “fix” is actually a kludge that lowers security.

Case: almost closed.

References and related
Checkpoint’s note on TCP packet out of state first packet isn’t SYN: https://community.checkpoint.com/t5/General-Topics/TCP-packet-out-of-state-First-packet-isn-t-SYN-tcp-flags-SYN-ACK/td-p/37166

The IT Detective agency cases are still coming fast and furious. Here’s another recent case. Failed to convert character

Categories
Admin Network Technologies

Firewall is a significant drag on download speeds

Intro
This post might be a restatement of the obvious to some, but I thought it was noteworthy enough to measure and mention this affect. I was twiddling my thumbs during a long sftp upload when I began to notice these transfers I was doing went really quickly between some servers, and not so much between others. How to control for all variables except the ones I wanted to vary? How to measure things in such a way that an overworked network technician with vested interests in saying the status quo is “good enough” will listen to you? These are things I wrestled with.

The details

To be continued…

Categories
IT Operational Excellence Uncategorized

The IT Detective Agency: Debugging a Thorny Citrix Connection Issue

This case begins with the observation by the application owner for Citrix XenApp. External users were being knocked out of their sessions frequently – several times a day. And it happened en masse. Before this problem users were typically logged in all day. You can see that many must have been bumped around 12:30 PM then again around 2 PM. The problems began July 5th.

The users suffering the disconnects were all external users who access the applications via a Citrix Secure Gateway. The XenApp servers being accessed are also used on the Intranet and those users were not seeing any drops.

The AO asked if I had changed anything in the network. Nope. Had he changed anything? Nope.

So now we have the classic stand-off, right? AO vs network. There’s a root cause and it’s either the AO or the network guy who’s ultimately at fault.

My attitude in these cases is the following: the network person should prove it’s an application problem and the application owner should prove it’s a network problem! It sounds cynical, but this approach aligns with the best interests of each party. Both are really working towards the same goal, but preserving their own interests. E.g., the networking person thinks that If I can prove it’s an application problem then the AO will quit bothering me and I can get back to my real job. After all, I am not knowledgeable about the application. Even if it is a networking issue, I do not know where the issue is so I need the AO to point out the problem at a detailed level, e.g., the dropped packet or whatever, so I can focus my energies. The reality in my experience is quite different however. The AO typically does not know enough about networking to make this proof.

Nonetheless I proceeded this way, hoping to prove some knid of application problem so I could get back to my normal activities.

We enabled my own PC to use the application. This is always much easier than bothering other people. I can take traces to my heart’s content! So Monday I was connected to XenApp via the CSG. I was going along fine until 11:35 when I got the disconnected message! I later learned that the bulk of users, who are using a different app, were not disconnected then, but were at about an hour later.

Now there’s lots of pieces to look at, any one of which could be at fault. Working from PC on Internet to the XenApp we have: The Internet, my Internet router, firewall, load balancer, CSG server, firewall, XenApp server. That’s a lot to look after, but you have to start somewhere. I chose the load balancer. It was rather confusing, even to establsih a baseline of “normal” activity. I quickly observed that every 30 seconds packets were being transmitted to the PC even when nothing was going on.. Of course the communication was all encrypted so I did not even attempt to look into the packets. But sometimes I saw seven packets, sometimes six, and more rarely different numbers. The packet order didn’t even make sense. Sometimes the load balancer responded to the XenApp before the PC did! The trace of this behaviour until I was disconnected will be shown here when I get the time to include it:

The end of the trace shows a bunch of FIN packets. FIN is used to terminate a TCP connection. Now we’re getting somewhere. It looks like, from a TCP perspective, that a more-or-less orderly shutdown of the connection was occuring. If confirmed that would point to an application problem and life would be good!

The next day I logged into CSG and used a XenApp app again. This time I did an additional trace and included the CSG server itself. Again I was disconnected after a few hours. In this trace the CSG server is called webservera, the XneApp server is xenapp15. This is not a byte-level trace but rather running snoop on Solaris and looking at the meta-data:

________________________________
11:29:23.81577 xenapp15 -> webservera    ETHER Type=0800 (IP), size = 60 bytes
11:29:23.81577 xenapp15 -> webservera    IP  D=10.201.142.41 S=10.201.88.34 LEN=46, ID=4842, TOS=0x0, TTL=126
11:29:23.81577 xenapp15 -> webservera    TCP D=56011 S=1494 Push Ack=1900995851 Seq=4198089996 Len=6 Win=64134
________________________________
11:29:25.01881 xenapp15 -> webservera    ETHER Type=0800 (IP), size = 60 bytes
11:29:25.01881 xenapp15 -> webservera    IP  D=10.201.142.41 S=10.201.88.34 LEN=46, ID=4844, TOS=0x0, TTL=126
11:29:25.01881 xenapp15 -> webservera    TCP D=56011 S=1494 Push Ack=1900995851 Seq=4198089996 Len=6 Win=64134
________________________________
11:29:27.42530 xenapp15 -> webservera    ETHER Type=0800 (IP), size = 60 bytes
11:29:27.42530 xenapp15 -> webservera    IP  D=10.201.142.41 S=10.201.88.34 LEN=46, ID=4861, TOS=0x0, TTL=126
11:29:27.42530 xenapp15 -> webservera    TCP D=56011 S=1494 Push Ack=1900995851 Seq=4198089996 Len=6 Win=64134
________________________________
11:29:30.87645    webservera -> xenapp15 ETHER Type=0800 (IP), size = 54 bytes
11:29:30.87645    webservera -> xenapp15 IP  D=10.201.88.34 S=10.201.142.41 LEN=40, ID=847, TOS=0x0, TTL=64
11:29:30.87645    webservera -> xenapp15 TCP D=1494 S=56011 Ack=4198090002 Seq=1900995851 Len=0 Win=48871
________________________________
11:29:30.87657    webservera -> xenapp15 ETHER Type=0800 (IP), size = 54 bytes
11:29:30.87657    webservera -> xenapp15 IP  D=10.201.88.34 S=10.201.142.41 LEN=40, ID=848, TOS=0x0, TTL=64
11:29:30.87657    webservera -> xenapp15 TCP D=1494 S=56011 Ack=4198090002 Seq=1900995851 Len=0 Win=48871
________________________________
11:29:33.02325    webservera -> xenapp15 ETHER Type=0800 (IP), size = 54 bytes
11:29:33.02325    webservera -> xenapp15 IP  D=10.201.88.34 S=10.201.142.41 LEN=40, ID=849, TOS=0x0, TTL=64
11:29:33.02325    webservera -> xenapp15 TCP D=1494 S=56011 Ack=4198090002 Seq=1900995851 Len=0 Win=48871
________________________________
11:29:53.34945 xenapp15 -> webservera    ETHER Type=0800 (IP), size = 60 bytes
11:29:53.34945 xenapp15 -> webservera    IP  D=10.201.142.41 S=10.201.88.34 LEN=40, ID=4923, TOS=0x0, TTL=126
11:29:53.34945 xenapp15 -> webservera    TCP D=56011 S=1494 Rst Ack=1900995851 Seq=4198090002 Len=0 Win=0

What I saw this time is that RST packet was being sent from the XenApp server! That’s the very last line, which I will repeat here for emphasis since it is so important to the case:

11:29:53.34945 xenapp15 -> webservera    TCP D=56011 S=1494 Rst Ack=1900995851 Seq=4198090002 Len=0 Win=0

TCP RST is a way to immediately disconnect a connection! It seemed as though this was begin converted to a FIN by the CSG. Now it’s looking very much like for whatever reason the application decided to terminate the connection. It almost has to be an application problem, right?

Wrong! We have to keep an open mind.

This trace, while dense, hints at where the problem may lie. It is taken on the load balancer with tcpdump -i 0.0. The load balancer has two interfaces, one towards the Internet, the other towards webserverw. The hostname of the load balancer’s Internet interface is called CSG, the hostname of the Citrix client on the Internet is drjohnspc.

11:03:59.392810 802.1Q vlan#4093 P0 drjohnspc.20723 > webserverw.https: . ack 35 win 32768 (DF)
11:03:59.455730 802.1Q vlan#4093 P0 webserverw.https > drjohnspc.20723: P 35:68(33) ack 1 win 48677 (DF)
11:03:59.554810 802.1Q vlan#4093 P0 drjohnspc.20723 > webserverw.https: . ack 68 win 32768 (DF)
11:03:59.585845 802.1Q vlan#4094 P0 drjohnspc.20723 > CSG.https: . ack 35 win 64426 (DF)
11:03:59.585855 802.1Q vlan#4094 P0 CSG.https > drjohnspc.20723: P 35:68(33) ack 1 win 32768 (DF)
11:03:59.885805 802.1Q vlan#4094 P0 drjohnspc.20723 > CSG.https: . ack 68 win 64393 (DF)
11:04:59.465070 802.1Q vlan#4093 P0 webserverw.https > drjohnspc.20723: P 68:103(35) ack 1 win 48677 (DF)
11:04:59.465080 802.1Q vlan#4094 P0 CSG.https > drjohnspc.20723: P 68:103(35) ack 1 win 32768 (DF)
11:04:59.564818 802.1Q vlan#4093 P0 drjohnspc.20723 > webserverw.https: . ack 103 win 32768 (DF)
11:05:00.664812 802.1Q vlan#4094 P0 CSG.https > drjohnspc.20723: P 68:103(35) ack 1 win 32768 (DF)
11:05:02.864812 802.1Q vlan#4094 P0 CSG.https > drjohnspc.20723: P 68:103(35) ack 1 win 32768 (DF)
11:05:07.064810 802.1Q vlan#4094 P0 CSG.https > drjohnspc.20723: P 68:103(35) ack 1 win 32768 (DF)
11:05:15.264811 802.1Q vlan#4094 P0 CSG.https > drjohnspc.20723: P 68:103(35) ack 1 win 32768 (DF)
11:05:31.464812 802.1Q vlan#4094 P0 CSG.https > drjohnspc.20723: P 68:103(35) ack 1 win 32768 (DF)
11:05:59.807514 802.1Q vlan#4093 P0 webserverw.https > drjohnspc.20723: P 103:130(27) ack 1 win 48677 (DF)
11:05:59.807741 802.1Q vlan#4093 P0 webserverw.https > drjohnspc.20723: F 130:130(0) ack 1 win 48677 (DF)
11:05:59.807754 802.1Q vlan#4093 P0 drjohnspc.20723 > webserverw.https: . ack 131 win 32768 (DF)
11:05:59.807759 802.1Q vlan#4094 P0 CSG.https > drjohnspc.20723: FP 103:130(27) ack 1 win 32768 (DF)
11:06:03.664813 802.1Q vlan#4094 P0 CSG.https > drjohnspc.20723: . 68:103(35) ack 1 win 32768 (DF)
11:06:12.642847 802.1Q vlan#4093 P0 drjohnspc.20723 > webserverw.https: R 1:1(0) ack 131 win 32768 (DF)
11:06:12.642862 802.1Q vlan#4094 P0 CSG.https > drjohnspc.20723: RP 103:130(27) ack 1 win 32768 (DF)

Notice the time stamp increasing by larger and larger leaps beginning with 11:05:00.664812. 11:05:02, 11:05:07, 11:05:15, 11:05:31 – the time keeps doubling! this is characteristic of a TCP retransmit. Note that all the other information is the same. It must be retransmitting the same packet. Why? Because it never got there! That seems to be the most likely reason. Now my conviction and hope that an application problem lies at the heart of the issue is starting to crumble. See why you need to keep an open mind? Your opinion can change to the polar opposite conclusion with the input of some additional data like that. Where to turn next?

There is a firewall inbetween the load balancer and the Internet. Now we will focus our attention on it. Could be that it dropped that packet and all the re-transmits.

Here’s the trace of that same conversation on the firewall’s internal interface (which faces the CSG) (I(O) means inbound(outbound) with respect to that interface):

11:04:59.441022  I IP CSG.https > drjohnspc.20723: P 2210302227:2210302262(35) ack 1714160833 win 32768
11:05:00.640781  I IP CSG.https > drjohnspc.20723: P 0:35(35) ack 1 win 32768
11:05:02.840742  I IP CSG.https > drjohnspc.20723: P 0:35(35) ack 1 win 32768
11:05:07.040729  I IP CSG.https > drjohnspc.20723: P 0:35(35) ack 1 win 32768
11:05:15.240780  I IP CSG.https > drjohnspc.20723: P 0:35(35) ack 1 win 32768
11:05:31.440571  I IP CSG.https > drjohnspc.20723: P 0:35(35) ack 1 win 32768
11:05:59.783366  I IP CSG.https > drjohnspc.20723: FP 35:62(27) ack 1 win 32768
11:06:03.640595  I IP CSG.https > drjohnspc.20723: . 0:35(35) ack 1 win 32768
^C

and the trace of the same thing on the firewall’s external interface, i.e., facing the Internet and drjohnspc:

11:03:59.269334  O IP CSG.https > drjohnspc.20723: P 2210302159:2210302194(35) ack 1714160833 win 32768
11:03:59.562011  I IP drjohnspc.20723 > CSG.https: . ack 35 win 64426
11:03:59.562139  O IP CSG.https > drjohnspc.20723: P 35:68(33) ack 1 win 32768
11:03:59.861970  I IP drjohnspc.20723 > CSG.https: . ack 68 win 64393

Notice what’s not present in the exterenal interface trace – all those re-transmits, or even the original packet.

Let’s summarize so far. One of those keep-alive packets from the XenApp server reached the firewall, but didn’t exit the firewall. the only possibility is that it got dropped by the firewall!

Now that was a lot of work, but who’s going to do it if not a patient and methodical IT person?

Results
We got a networking problem on our hands after all. Good thing we persisted in this investigation even when it looked like we were off the hook! Later it was confirmed that the firewall was “aggressively aging” its connections because it had either reached or was very close to its connection limit. The firewall connection limit was raised and the Citrix connection issues went away.

Let’s go back to that simplistic question that non-experts like to ask: what had changed that caused this problem? The change was external events – increased usage of that firewall. Network bandwidth, Internet usage – they all tend to increase over time. There were no changes done by either networking or the application group to cause this issue. A seasoned IT detective uses all available clues and arrives at the right conclusion. The “what has changed” question is normally very relevant, but it can’t be the only tool in your toolbox!

Case closed!