Intro
Yesterday the company I’ve been consulting for had a partial outage with their multi-gigabit fiber connection with TWC business class in North Carolina. We’ve never seen an outage with these characteristics.
The details
The outage was mostly unnoticed but various SiteScope monitors that fetch web pages were periodically going off and then working again. So it wasn’t a hard outage. How do you pin something like that down?
It’s easiest to relate to a web site whose IP address isn’t constantly changing, which pretty much rules out the majors like google.com or microsoft.com. I actually used my own site – it’s running on good infrastructure in Amazon’s data center and its IP is fixed. And yes I was occasionally seeing timeouts. Could it be simply waiting for a DNS lookup? That would explain everything. So I ran verbosely:
$ curl ‐vv www.drjohnstechtalk.com
* About to connect() to www.drjohnstechtalk.com port 80 (#0) * Trying 50.17.188.196... |
That response came quickly, then it froze. So I knew it wasn’t a DNS resolution problem. This could have also been shown by doing a trace, but the curl method was a faster way to get results.
I decided to put a max_time limit on curl of one second just to get a feel for the problem, running this command frequently:
$ curl ‐m1 www.drjohnstechtalk.com
After all when the web site is working and the gigabit connection is working, the answer comes back in 60 msec. So 1 second should be more than enough time.
So while there was some of this:
$ curl ‐m1 www.drjohnstechtalk.com
<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN"> <html><head> <title>301 Moved Permanently</title> </head><body> <h1>Moved Permanently</h1> <p>The document has moved <a href="https://drjohnstechtalk.com/blog/">here</a>.</p> <hr> <address>Apache/2 Server at www.drjohnstechtalk.com Port 80</address> </body></html> |
there was also some of this:
$ curl ‐m1 www.drjohnstechtalk.com
curl: (28) connect() timed out!
We determined on our perimeter firewall that it was passing all packets.
Again, the advantage of a fixed, rarely used IP address is that you can throw it into a trace statement and not get overwhelmed with noise. So I could see from a trace taken during one of those timeouts that we weren’t getting a response to our SYN packet.
So I tried to use a SYN packet geneator to reproduce the problem and learn more. scapy. I’ve just improved my write-up of scapy to reflect some of the new ways I used it yesterday!
To begin with, I didn’t know much about scapy except what I had previously posted. But it worked every time! I could not reproduce the problem no matter how hard I tried:
>>> sr(IP(dst=”drjohnstechtalk.com”)/TCP(dport=80))
Begin emission: ........................................................................................Finished to send 1 packets. .............................................* Received 134 packets, got 1 answers, remaining 0 packets (<Results: TCP:1 UDP:0 ICMP:0 Other:0>, <Unanswered: TCP:0 UDP:0 ICMP:0 Other:0>) |
I racked my brains, what could be different about these scapy packets?? Also not being a TCP expert the answer is many, many things. But did I give up? No! I quickly scanned the scapy for dummies tutorial and realized a few things. I assumed scapy was randomizing its source port the way all other TCP applications do, but it wasn’t! You need a sport=RandShort() argument in the TCP section to do that. Who knew? So I had been sending packets from the same source port, specifically 20. When I switched it to a randomized port I quickly reproduced the timeout issue! And most amazingly, when I encountered a port that didn’t work, it consistently didn’t work – every single time. Its neighboring ports were fine. Some of its neighbors’ neighbors didn’t work, also consistently.
So for instance
>>> sr(IP(dst=”drjohnstechtalk.com”)/TCP(dport=80,sport=21964))
Begin emission: ........................................................................................Finished to send 1 packets. ........................................ |
was consistently not working. Same for source port 21962. Source port 21963 was consistently fine.
Well, this explains the intermittent SiteScope errors.
Gotta be the firewall
I know what you’re thinking. Routers don’t care about TCP port information. That’s much more like a firewall connection table thing. And I agree, but our firewall trace showed these SYN packets getting through, and no SYN_ACK coming back.
It’ way too difficult to do a trace on a Cisco router, but I looked at the router config and didn’t see anything amiss.
So I called the ISP, TWC business class. I got a pre-recorded message talking about outages in North Carolina, where this link just happens to be located! The coincidence seems too great. I still don’t have clarity from them – I guess customer service is not their strong suit. They haven’t even bothered to get back to me 24 hours later (and this is for a major fiber circuit).
References and related
The amazingly customizable packet generator known as scapy.
Probably the best write-up of scapy is this scapy for dummies PDF.