Categories
DNS IT Operational Excellence Network Technologies Uncategorized

Google’s DNS Servers Rock!

Intro
DNS is the Domain name Service, the Internet service that converts IP addresses, e.g., 200.54.129.57 into mnemonic names like www.mysite.com.

I tried to run a cache-only DNS server for use by a proxy server. What I found is that certain sites were not accessible on a frequent basis. I think uol.com.br is one of the problem sites (need to check this). It may not mean much to a US audience, but it’s really popular in Brazil!

At some point I happened to learn that Google has a public DNS service. This is worth pondering. No one of any repute has offered a DNS service to that point. There are a host of concerns about security, especially DNS cache poisoning. They blazed a trail, and did it in a way only Google and very few other major infrastructure players could. Not only did they offer a DNS service, they put their DNS servers all over the Internet and created convenient anycast addresses for their servers.

I am no expert on anycast addresses. You can look it up on Wikipedia, however. The essence for my purposes is that with a single IP address you’re going to hit the closest server, network-wise. So no matter where you are some Google DNS server is not far away. Try it. The anycast addresses are 8.8.8.8 and 8.8.4.4. They don’t mind, really! You can ping them. Traceroute to them, whatever. From the Amazon cloud Northeast 8.8.8.8 responds to PINGs in 3.4 ms. That’s really low. Not so low as to make me think they are in the same data center (it is different companies after all), but not far away.

The gold standard for running a DNS service is BIND. I have been running it for many years now and I want to give the Internet Software Consortium their due for providing this wonderful application. Once I got wind of my DNS difficulties as mentioned above, I had to wonder why not everyone else was complaining? They had to be using something else. I ran a flat-out performance test. 5000 queries from an actual proxy log, fed straight to my BIND DNS server, and then to Google’s DNS server 8.8.8.8. I have to dig up the numbers, but Google’s won by quite a bit! This result was actually surprising because you’re always going off-site to the Google DNS server, whereas my server can build up its cache and is right on my network. From where I tested the Google server was about 11 ms away. So 5000 x 11 ms = 55 s. So there is a 55 s handicap from just network considerations alone! Yet it is faster. On the quickest of queries the local server is indeed faster, but what happens is that over the course of real life queries, you always get a few problematic ones which either time out or just seem to take a long time to get back a response. That’s what kills the traditional DNS server and where Google has (obviously) made some optimizations.

And, that’s not all! Google also deals in a more forgiving fashion with broken domain names. I used to get on my high horse and proclaim to others about how broken their DNS servers are – it’s no wonder I can’t resolve their names, which means, by the way, I also cannot get to their web site nor send them email!

It’s effectively like taking yourself off the Internet, or so I thought. Turns out in some cases that’s only true if you’ve constrained yourself to resolving names with BIND. You see, BIND enforces the rules. And I’m a believer in rules. The Internet has about 5,000 technical rules called RFCs. DNS is a topic of many of these rules. The Internet could only have expanded to the size it currently has because all the major players agreed to abide by those rules. What Google has done with their server, in effect, is to say, “Well, if you don’t follow the rules, we’re going to try to work with you anyways.”

Here’s a concrete example. appliedcoatings.org. I guess at some point they’ll actually fix their severely broken DNS, but at the time I write this, August 21, 2011, these comments are valid and their domain is severely broken. In fact, I was amazed that people weren’t jumping up and down screaming at them. I couldn’t even send an email to them. That’s akin to knocking yourself off the Internet, right? Ah, but it all depends on whose DNS servers you are using!

There used to be lots of good free DNS analyzers, like dnsreport.com. You can still find a few around. www.zonecheck.fr, for instance. It shows FAILURE. If it were better written it would show the real problem, which is a lame delegation. But we’re experts, and we don’t need such tools! We will do the queries ourselves and show the lame delegation. We start by learning who are the authoritative nameservers for .ca, the top-level domain used in Canada:

 dig ns ca

; <<>> DiG 9.7.1-P2 <<>> ns ca
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 52928
;; flags: qr rd ra; QUERY: 1, ANSWER: 10, AUTHORITY: 0, ADDITIONAL: 1

;; QUESTION SECTION:
;ca.                            IN      NS

;; ANSWER SECTION:
ca.                     83585   IN      NS      a.ca-servers.ca.
ca.                     83585   IN      NS      c.ca-servers.ca.
ca.                     83585   IN      NS      e.ca-servers.ca.
ca.                     83585   IN      NS      f.ca-servers.ca.
ca.                     83585   IN      NS      j.ca-servers.ca.
ca.                     83585   IN      NS      k.ca-servers.ca.
ca.                     83585   IN      NS      l.ca-servers.ca.
ca.                     83585   IN      NS      m.ca-servers.ca.
ca.                     83585   IN      NS      z.ca-servers.ca.
ca.                     83585   IN      NS      sns-pb.isc.org.

;; ADDITIONAL SECTION:
a.ca-servers.ca.        83594   IN      A       192.228.27.11

Now we ask one of them about the nameservers for appliedcoatings.ca:

 dig ns appliedcoatings.ca @a.ca-servers.ca.

; <<>> DiG 9.7.1-P2 <<>> ns appliedcoatings.ca @a.ca-servers.ca.
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 288
;; flags: qr rd; QUERY: 1, ANSWER: 0, AUTHORITY: 2, ADDITIONAL: 0
;; WARNING: recursion requested but not available

;; QUESTION SECTION:
;appliedcoatings.ca.            IN      NS

;; AUTHORITY SECTION:
appliedcoatings.ca.     86400   IN      NS      sp2.domainpeople.com.
appliedcoatings.ca.     86400   IN      NS      sp1.domainpeople.com.

So far everything's cool. Now, since the authoritative flag (AA) was not present in that response we re-ask that query, but now to one of the nameservers that's supposed to be authoritative for that domain:

dig ns appliedcoatings.ca @sp2.domainpeople.com.

; <<>> DiG 9.7.1-P2 <<>> ns appliedcoatings.ca @sp2.domainpeople.com.
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 24373
;; flags: qr aa rd; QUERY: 1, ANSWER: 2, AUTHORITY: 0, ADDITIONAL: 0
;; WARNING: recursion requested but not available

;; QUESTION SECTION:
;appliedcoatings.ca.            IN      NS

;; ANSWER SECTION:
appliedcoatings.ca.     86400   IN      NS      ns1.domainpeople.com.
appliedcoatings.ca.     86400   IN      NS      ns2.domainpeople.com.

Oh, oh. That's not supposed to happen. We're getting back an entirely different set of nameservers. That's a lame delegation. The domain should be considered completely broken. I think even BIND might be forgiving up to this point. a BIND resolver does these types of quesires to get at the answer. At this point it says, "OK, this is strange, but not necessariily fatal. I will ask my subsequent queries to ns1.domainpeople.com and ns2.domainpeople.com since they are listed as being the nameservers of record.

So now let's get to something useful: looking up the mail exchanger record so we see how to deliver mail to this domain. BIND, which has been fastidiously following the rules, does it as follows:

dig mx appliedcoatings.ca @ns1.domainpeople.com.

; <<>> DiG 9.7.1-P2 <<>> mx appliedcoatings.ca @ns1.domainpeople.com.
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: REFUSED, id: 49996
;; flags: qr rd; QUERY: 1, ANSWER: 0, AUTHORITY: 0, ADDITIONAL: 0
;; WARNING: recursion requested but not available

;; QUESTION SECTION:
;appliedcoatings.ca.            IN      MX

;; Query time: 79 msec
;; SERVER: 204.174.223.72#53(204.174.223.72)
;; WHEN: Sun Aug 21 19:05:43 2011
;; MSG SIZE  rcvd: 36

That's not good. Status is REFUSED. But BIND can even forgive this slight. There is one more nameserver to try after all, right? Last chance query:

dig mx appliedcoatings.ca @ns2.domainpeople.com.

; <<>> DiG 9.7.1-P2 <<>> mx appliedcoatings.ca @ns2.domainpeople.com.
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: REFUSED, id: 44404
;; flags: qr rd; QUERY: 1, ANSWER: 0, AUTHORITY: 0, ADDITIONAL: 0
;; WARNING: recursion requested but not available

;; QUESTION SECTION:
;appliedcoatings.ca.            IN      MX

;; Query time: 72 msec
;; SERVER: 64.40.96.140#53(64.40.96.140)
;; WHEN: Sun Aug 21 19:07:34 2011
;; MSG SIZE  rcvd: 36

Status also REFUSED. Now we are really and truly dead. If you are using a BIND nameserver you have no way to send email to someone@appliedcoatings.ca. But not so with Google!

Of course I don't know how Google wrote their DNS server, but I do think that some of their infrastructure experts write it themselves rather than using open source programs. So with a Google nameserver you will get a response:

dig mx appliedcoatings.ca @8.8.8.8

; <<>> DiG 9.7.1-P2 <<>> mx appliedcoatings.ca @8.8.8.8
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 6901
;; flags: qr rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 0

;; QUESTION SECTION:
;appliedcoatings.ca.            IN      MX

;; ANSWER SECTION:
appliedcoatings.ca.     82805   IN      MX      10 mail.appliedcoatings.ca.

;; Query time: 4 msec
;; SERVER: 8.8.8.8#53(8.8.8.8)
;; WHEN: Sun Aug 21 19:11:14 2011
;; MSG SIZE  rcvd: 57

and just to close the loop and make sure this is a valid host you would do this:

dig mail.appliedcoatings.ca @8.8.8.8

; <<>> DiG 9.7.1-P2 <<>> mail.appliedcoatings.ca @8.8.8.8
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 35190
;; flags: qr rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 0

;; QUESTION SECTION:
;mail.appliedcoatings.ca.       IN      A

;; ANSWER SECTION:
mail.appliedcoatings.ca. 86400  IN      A       66.183.21.181

And we can go the next step and begin an SMTP conversation with that server to make sure it is really operating. After all, if they messed up DNS there's no telling what else they might have gotten wrong.

 telnet  66.183.21.181 25
Trying 66.183.21.181...
Connected to 66.183.21.181.
Escape character is '^]'.
220 mail.appliedcoatings.ca Microsoft ESMTP MAIL Service, Version: 6.0.3790.4675 ready at  Sun, 21 Aug 2011 16:22:04 -0700
HELO localhost
250 mail.appliedcoatings.ca Hello [50.17.188.196]
quit
221 2.0.0 mail.appliedcoatings.ca Service closing transmission channel
Connection closed by foreign host.

Yup. They've got an operating mail server at that IP.

So we can reverse engineer a bit what Google's DNS server must have done behind the scenes to arrive at a valid answer where BIND could not. I'm 100% sure that Google would have also done the query

dig mx appliedcoatings.ca @ns1.domainpeople.com

since that is the right thing to do. But not getting a satisfactory answer (status: REFUSED), what it must do additionally after getting refused a second time by ns2.domainpeople, is to go back to the originally named nameservers sp1 and sp2. Watch what happens in that case:

 dig mx appliedcoatings.ca @sp1.domainpeople.com.

; <<>> DiG 9.7.1-P2 <<>> mx appliedcoatings.ca @sp1.domainpeople.com.
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 10226
;; flags: qr aa rd; QUERY: 1, ANSWER: 1, AUTHORITY: 2, ADDITIONAL: 1
;; WARNING: recursion requested but not available

;; QUESTION SECTION:
;appliedcoatings.ca.            IN      MX

;; ANSWER SECTION:
appliedcoatings.ca.     86400   IN      MX      10 mail.appliedcoatings.ca.

;; AUTHORITY SECTION:
appliedcoatings.ca.     86400   IN      NS      ns1.domainpeople.com.
appliedcoatings.ca.     86400   IN      NS      ns2.domainpeople.com.

;; ADDITIONAL SECTION:
mail.appliedcoatings.ca. 86400  IN      A       66.183.21.181

The AA (authoritative) flag is set in the response. So it's a good response, but sent to the "wrong" nameserver. Nevertheless, it is a response and it gets anyone using that nameserver more functionality than someone using BIND.

Conclusion
So far we've got three advantages speaking favorably for Google's DNS server: it's faster, it's answers are more complete and it's universally available. Wait, there's more! Another nice thing is what it does not do. Some ISPs have a "feature" I call DNS clobbering. In fact it's so annoying I will devote a whole blog post to describing it in more detail. Essentially they take license with DNS and make up answers to some queries! It's true and it's truly annoying. Not all ISPs do this but mine certainly does. So the other nice thing about Google DNS is that it does not do DNS clobbering and it's available for you to use it at home and avoid this annoying feature. You just set your DNS servers rather than have them assigned automatically via DHCP.

Other Resources
I should mention that while researching public DNS servers I was also led to commercial versions of the same thing. I went so far as to test the timings on one of those services and found that it is more distant, round-trip-wise, than Google's anycast server. Stands to reason. Google's got the best Internet access of anyone. They're on all the major highways. The commercial offerings have some additional cool features, however. They can serve as URL filter. So if someone puts in a URL which leads to a malicious site, for example, they can respond with an answer that spares you from going to that infected site. This is a little more crude than URL filtering at the proxy level, since a DNS server has no knowledge of the URI whereas a proxy URL filter does, but it could be quite serviceable. I'm not sure it allows you to pick and choose URL categories to block as with a URL filter (gambling, porn, hacking sites, etc.).

A lot more information on using Google DNS is at http://code.google.com/speed/public-dns/docs/using.html.

September 1 Update - a Crack in the Infrastructure
I now have my first case of a domain name which Google DNS did not resolve correctly, and for no apparent reason. The domain name is forums.tweaktown.com. Here's proof of Google's failure, followed immediately by Amazon's DNS servers' success:

dig forums.tweaktown.com @8.8.8.8

; <<>> DiG 9.7.1-P2 <<>> forums.tweaktown.com @8.8.8.8
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NXDOMAIN, id: 15826
;; flags: qr rd ra; QUERY: 1, ANSWER: 0, AUTHORITY: 1, ADDITIONAL: 0

;; QUESTION SECTION:
;forums.tweaktown.com.          IN      A

;; AUTHORITY SECTION:
tweaktown.com.          116     IN      SOA     ns21.domaincontrol.com. dns.jomax.net. 2011060602 28800 7200 604800 86400

;; Query time: 4 msec
;; SERVER: 8.8.8.8#53(8.8.8.8)
;; WHEN: Thu Sep  1 14:40:50 2011
;; MSG SIZE  rcvd: 106


 dig forums.tweaktown.com

; <<>> DiG 9.7.1-P2 <<>> forums.tweaktown.com
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 52290
;; flags: qr rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 2, ADDITIONAL: 1

;; QUESTION SECTION:
;forums.tweaktown.com.          IN      A

;; ANSWER SECTION:
forums.tweaktown.com.   1885    IN      A       38.101.21.25

;; AUTHORITY SECTION:
tweaktown.com.          1943    IN      NS      ns22.domaincontrol.com.
tweaktown.com.          1943    IN      NS      ns21.domaincontrol.com.

;; ADDITIONAL SECTION:
ns21.domaincontrol.com. 753     IN      A       216.69.185.11

;; Query time: 0 msec
;; SERVER: 172.16.0.23#53(172.16.0.23)
;; WHEN: Thu Sep  1 14:40:55 2011
;; MSG SIZE  rcvd: 122

All BIND servers I tried during this time returned the correct answer.

Is this an isolated incident or a tip of an iceberg of problems? I hope it is a one-off. I'll post updates as I find out more. I am slightly concerned now.

References and related
I finally wrote my own web interface to DNS and published the code I did it with. Check it out here.

A web interface to Google's public DNS service, which will give you more debug information, is https://dns.google.com/

Categories
Admin Internet Mail IT Operational Excellence

The IT Detective Agency: The Case of Slow Sendmail Performance Finally Cracked

I’ve been running sendmail for years and years. It’s a very solid MTA, though perhaps not fashionable these days. At one point I even made the leap from running on Sun/solaris to SLES. I’ve always had a particular problem on a couple of these servers: they do not react gracefully to mail storms. An application running on another server sends out a daily mail blast to 2000 users, all at once. Hey I’m not running Gmail here, but normal volume is several messages per second nonetheless, and that is handled fairly well.

But this mail blast actually knocks the system offline for a few minutes. The load average rockets up to 160. It’s essentially a self-inflicted denial-of-service attack. In my gut I always felt the situation could be improved, but was too busy to look into it.

When it was time to buy a replacement server, I had to consider and justify what to get. A “screaming server” is a little hard for a hardware vendor to turn into an order! So where are the bottlenecks? I decided to capture output of uptime, which provides load averages, and iostat, an optional package which analyzes I/O usage, at five secon intervals throughout the day. Here’s the iostat job:

nohup iostat -t -c  -m -x 3 > /tmp/iostat &

and the uptime was a tiny script I called cpu-loop.sh:

#!/bin/sh
while /bin/true; do
sleep 5
date
uptime
done

called from the command line as:

nohup ~/cpu-loop.sh > /tmp/cpu &

Strange thing is that though load average shoots the roof, cpu usage isn’t all that high.

If I have this right, load average shows the number of processes scheduled by the scheduler. Sendmail forks a process for each incoming email, so the number of sendmail processes climbs dramatically during a mail storm.

The fundamental issue is are we thirsting for more CPU or more I/O? Then there are the peripheral concerns like speed of pci bus, size of level two cache and number of cpus. The standard profiling tools don’t quite give you enough information.

Here’s actual output of three consecutive iostat executions:

Time: 05:11:56 AM
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           5.92    0.00    5.36   21.74    0.00   66.99

Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await  svctm  %util
sda               0.00    10.00    0.00    3.00     0.00     0.05    37.33     0.03    8.53   5.33   1.60
sdb               0.00   788.40    0.00  181.40     0.00     3.91    44.12     4.62   25.35   5.46  98.96
dm-0              0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
dm-1              0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
dm-2              0.00     0.00    0.00    2.40     0.00     0.01     8.00     0.02    8.00   1.33   0.32
dm-3              0.00     0.00    0.00    2.40     0.00     0.01     8.00     0.01    5.67   2.33   0.56
dm-4              0.00     0.00    0.00    0.80     0.00     0.00     8.00     0.01   12.00   6.00   0.48
dm-5              0.00     0.00    0.00    7.60     0.00     0.03     8.00     0.08   10.32   1.05   0.80
hda               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
dm-6              0.00     0.00    0.00  975.00     0.00     3.81     8.00    20.93   21.39   1.01  98.96
dm-7              0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00

Time: 05:12:01 AM
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           5.05    0.00    4.34   19.98    0.00   70.64

Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await  svctm  %util
sda               0.00    10.80    0.00    2.80     0.00     0.05    40.00     0.03   10.57   6.86   1.92
sdb               0.00   730.60    0.00  164.80     0.00     3.64    45.20     3.37   20.56   5.47  90.16
dm-0              0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
dm-1              0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
dm-2              0.00     0.00    0.00    2.60     0.00     0.01     8.00     0.03   12.31   2.15   0.56
dm-3              0.00     0.00    0.00    2.40     0.00     0.01     8.00     0.02    6.33   3.33   0.80
dm-4              0.00     0.00    0.00    0.80     0.00     0.00     8.00     0.01    9.00   5.00   0.40
dm-5              0.00     0.00    0.00    7.60     0.00     0.03     8.00     0.10   13.37   1.16   0.88
hda               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
dm-6              0.00     0.00    0.00  899.60     0.00     3.51     8.00    16.18   18.03   1.00  90.24
dm-7              0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00

Time: 05:12:06 AM
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           1.91    0.00    1.36   10.83    0.00   85.89

Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await  svctm  %util
sda               0.00     6.40    0.00    3.40     0.00     0.04    25.88     0.04   12.94   5.18   1.76
sdb               0.00   303.40    0.00   88.20     0.00     1.59    36.95     1.83   20.30   5.48  48.32
dm-0              0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
dm-1              0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
dm-2              0.00     0.00    0.00    2.60     0.00     0.01     8.00     0.04   14.77   2.46   0.64
dm-3              0.00     0.00    0.00    0.60     0.00     0.00     8.00     0.00   12.00   5.33   0.32
dm-4              0.00     0.00    0.00    0.80     0.00     0.00     8.00     0.01   11.00   5.00   0.40
dm-5              0.00     0.00    0.00    5.80     0.00     0.02     8.00     0.08   12.97   1.66   0.96
hda               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
dm-6              0.00     0.00    0.00  393.00     0.00     1.54     8.00     6.46   16.03   1.23  48.32
dm-7              0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00

Device sdb has reached crazy high utilization levels – 98% before dropping back down to 48%. An average queue size of 4.62 in the first run means a lot of queued up processes awaiting I/O. Write requests (merged) per second of 788 seems respectable. All this, while the CPU is 67% idle!

The conclusion: a solid state drive is in order. We are dying thirsting for I/O more than for CPU. But solid state drives cost money and have to be justified which takes time. Can we do something which proves it will bear out our hypothesis and really alleviate the problem? Yes! SSD is like accessing memory. So let’s build a virtual partition from our memory. tmpfs has made this sinfully easy:

mount -t tmpfs none /mqueue -o size=8192m

We set this to be sendmail’s queue directory. The sendmail mc command looks like this:

define(`QUEUE_DIR',`/mqueue/q*')dnl

which I need to further explain at some point.

Now it’s interesting that this tmpfs filesystem doesn’t even show up in iostat! I guess its usage all counts as cpu usage.

I now have to send my mail blast to the system with this tmpfs setup. I’m expecting to have essentially converted my lack of I/O into better usage of spare CPU, resulting in a higher-performance system.

The Results
The results are in and they are dramatic. Previous results using traditional 15K rotating drive:

- disk device became 98% busy
- cpu idle time only dropped as low as 69%
- load average peaked at 37
- SMTP port shut down for some minutes
- 2030 messages accepted in 187 seconds
- 11 messages/second

and now using tmpfs virtual filesystem:

- the load average rose to 3.1 - a much more tolerable result
- the cpu idle time dropped to 32% during the busiest time
- most imporantly, the server stayed open for business - the SMTP port did not shut down for the first time!!
- the 2000 messages were accepted in 34 seconds.  
- that's a record 59 messages/second!

Conclusion
Disk I/O was definitely the bottleneck for sendmail. tmpfs rocks! sendmail becomes five times faster using it, and is better behaved. The drawback of this filesystem type is that it is completely volatile and I stand to lose messages if the power ever goes out!

Case Closed!

Categories
IT Operational Excellence Uncategorized

The IT Detective Agency: Debugging a Thorny Citrix Connection Issue

This case begins with the observation by the application owner for Citrix XenApp. External users were being knocked out of their sessions frequently – several times a day. And it happened en masse. Before this problem users were typically logged in all day. You can see that many must have been bumped around 12:30 PM then again around 2 PM. The problems began July 5th.

The users suffering the disconnects were all external users who access the applications via a Citrix Secure Gateway. The XenApp servers being accessed are also used on the Intranet and those users were not seeing any drops.

The AO asked if I had changed anything in the network. Nope. Had he changed anything? Nope.

So now we have the classic stand-off, right? AO vs network. There’s a root cause and it’s either the AO or the network guy who’s ultimately at fault.

My attitude in these cases is the following: the network person should prove it’s an application problem and the application owner should prove it’s a network problem! It sounds cynical, but this approach aligns with the best interests of each party. Both are really working towards the same goal, but preserving their own interests. E.g., the networking person thinks that If I can prove it’s an application problem then the AO will quit bothering me and I can get back to my real job. After all, I am not knowledgeable about the application. Even if it is a networking issue, I do not know where the issue is so I need the AO to point out the problem at a detailed level, e.g., the dropped packet or whatever, so I can focus my energies. The reality in my experience is quite different however. The AO typically does not know enough about networking to make this proof.

Nonetheless I proceeded this way, hoping to prove some knid of application problem so I could get back to my normal activities.

We enabled my own PC to use the application. This is always much easier than bothering other people. I can take traces to my heart’s content! So Monday I was connected to XenApp via the CSG. I was going along fine until 11:35 when I got the disconnected message! I later learned that the bulk of users, who are using a different app, were not disconnected then, but were at about an hour later.

Now there’s lots of pieces to look at, any one of which could be at fault. Working from PC on Internet to the XenApp we have: The Internet, my Internet router, firewall, load balancer, CSG server, firewall, XenApp server. That’s a lot to look after, but you have to start somewhere. I chose the load balancer. It was rather confusing, even to establsih a baseline of “normal” activity. I quickly observed that every 30 seconds packets were being transmitted to the PC even when nothing was going on.. Of course the communication was all encrypted so I did not even attempt to look into the packets. But sometimes I saw seven packets, sometimes six, and more rarely different numbers. The packet order didn’t even make sense. Sometimes the load balancer responded to the XenApp before the PC did! The trace of this behaviour until I was disconnected will be shown here when I get the time to include it:

The end of the trace shows a bunch of FIN packets. FIN is used to terminate a TCP connection. Now we’re getting somewhere. It looks like, from a TCP perspective, that a more-or-less orderly shutdown of the connection was occuring. If confirmed that would point to an application problem and life would be good!

The next day I logged into CSG and used a XenApp app again. This time I did an additional trace and included the CSG server itself. Again I was disconnected after a few hours. In this trace the CSG server is called webservera, the XneApp server is xenapp15. This is not a byte-level trace but rather running snoop on Solaris and looking at the meta-data:

________________________________
11:29:23.81577 xenapp15 -> webservera    ETHER Type=0800 (IP), size = 60 bytes
11:29:23.81577 xenapp15 -> webservera    IP  D=10.201.142.41 S=10.201.88.34 LEN=46, ID=4842, TOS=0x0, TTL=126
11:29:23.81577 xenapp15 -> webservera    TCP D=56011 S=1494 Push Ack=1900995851 Seq=4198089996 Len=6 Win=64134
________________________________
11:29:25.01881 xenapp15 -> webservera    ETHER Type=0800 (IP), size = 60 bytes
11:29:25.01881 xenapp15 -> webservera    IP  D=10.201.142.41 S=10.201.88.34 LEN=46, ID=4844, TOS=0x0, TTL=126
11:29:25.01881 xenapp15 -> webservera    TCP D=56011 S=1494 Push Ack=1900995851 Seq=4198089996 Len=6 Win=64134
________________________________
11:29:27.42530 xenapp15 -> webservera    ETHER Type=0800 (IP), size = 60 bytes
11:29:27.42530 xenapp15 -> webservera    IP  D=10.201.142.41 S=10.201.88.34 LEN=46, ID=4861, TOS=0x0, TTL=126
11:29:27.42530 xenapp15 -> webservera    TCP D=56011 S=1494 Push Ack=1900995851 Seq=4198089996 Len=6 Win=64134
________________________________
11:29:30.87645    webservera -> xenapp15 ETHER Type=0800 (IP), size = 54 bytes
11:29:30.87645    webservera -> xenapp15 IP  D=10.201.88.34 S=10.201.142.41 LEN=40, ID=847, TOS=0x0, TTL=64
11:29:30.87645    webservera -> xenapp15 TCP D=1494 S=56011 Ack=4198090002 Seq=1900995851 Len=0 Win=48871
________________________________
11:29:30.87657    webservera -> xenapp15 ETHER Type=0800 (IP), size = 54 bytes
11:29:30.87657    webservera -> xenapp15 IP  D=10.201.88.34 S=10.201.142.41 LEN=40, ID=848, TOS=0x0, TTL=64
11:29:30.87657    webservera -> xenapp15 TCP D=1494 S=56011 Ack=4198090002 Seq=1900995851 Len=0 Win=48871
________________________________
11:29:33.02325    webservera -> xenapp15 ETHER Type=0800 (IP), size = 54 bytes
11:29:33.02325    webservera -> xenapp15 IP  D=10.201.88.34 S=10.201.142.41 LEN=40, ID=849, TOS=0x0, TTL=64
11:29:33.02325    webservera -> xenapp15 TCP D=1494 S=56011 Ack=4198090002 Seq=1900995851 Len=0 Win=48871
________________________________
11:29:53.34945 xenapp15 -> webservera    ETHER Type=0800 (IP), size = 60 bytes
11:29:53.34945 xenapp15 -> webservera    IP  D=10.201.142.41 S=10.201.88.34 LEN=40, ID=4923, TOS=0x0, TTL=126
11:29:53.34945 xenapp15 -> webservera    TCP D=56011 S=1494 Rst Ack=1900995851 Seq=4198090002 Len=0 Win=0

What I saw this time is that RST packet was being sent from the XenApp server! That’s the very last line, which I will repeat here for emphasis since it is so important to the case:

11:29:53.34945 xenapp15 -> webservera    TCP D=56011 S=1494 Rst Ack=1900995851 Seq=4198090002 Len=0 Win=0

TCP RST is a way to immediately disconnect a connection! It seemed as though this was begin converted to a FIN by the CSG. Now it’s looking very much like for whatever reason the application decided to terminate the connection. It almost has to be an application problem, right?

Wrong! We have to keep an open mind.

This trace, while dense, hints at where the problem may lie. It is taken on the load balancer with tcpdump -i 0.0. The load balancer has two interfaces, one towards the Internet, the other towards webserverw. The hostname of the load balancer’s Internet interface is called CSG, the hostname of the Citrix client on the Internet is drjohnspc.

11:03:59.392810 802.1Q vlan#4093 P0 drjohnspc.20723 > webserverw.https: . ack 35 win 32768 (DF)
11:03:59.455730 802.1Q vlan#4093 P0 webserverw.https > drjohnspc.20723: P 35:68(33) ack 1 win 48677 (DF)
11:03:59.554810 802.1Q vlan#4093 P0 drjohnspc.20723 > webserverw.https: . ack 68 win 32768 (DF)
11:03:59.585845 802.1Q vlan#4094 P0 drjohnspc.20723 > CSG.https: . ack 35 win 64426 (DF)
11:03:59.585855 802.1Q vlan#4094 P0 CSG.https > drjohnspc.20723: P 35:68(33) ack 1 win 32768 (DF)
11:03:59.885805 802.1Q vlan#4094 P0 drjohnspc.20723 > CSG.https: . ack 68 win 64393 (DF)
11:04:59.465070 802.1Q vlan#4093 P0 webserverw.https > drjohnspc.20723: P 68:103(35) ack 1 win 48677 (DF)
11:04:59.465080 802.1Q vlan#4094 P0 CSG.https > drjohnspc.20723: P 68:103(35) ack 1 win 32768 (DF)
11:04:59.564818 802.1Q vlan#4093 P0 drjohnspc.20723 > webserverw.https: . ack 103 win 32768 (DF)
11:05:00.664812 802.1Q vlan#4094 P0 CSG.https > drjohnspc.20723: P 68:103(35) ack 1 win 32768 (DF)
11:05:02.864812 802.1Q vlan#4094 P0 CSG.https > drjohnspc.20723: P 68:103(35) ack 1 win 32768 (DF)
11:05:07.064810 802.1Q vlan#4094 P0 CSG.https > drjohnspc.20723: P 68:103(35) ack 1 win 32768 (DF)
11:05:15.264811 802.1Q vlan#4094 P0 CSG.https > drjohnspc.20723: P 68:103(35) ack 1 win 32768 (DF)
11:05:31.464812 802.1Q vlan#4094 P0 CSG.https > drjohnspc.20723: P 68:103(35) ack 1 win 32768 (DF)
11:05:59.807514 802.1Q vlan#4093 P0 webserverw.https > drjohnspc.20723: P 103:130(27) ack 1 win 48677 (DF)
11:05:59.807741 802.1Q vlan#4093 P0 webserverw.https > drjohnspc.20723: F 130:130(0) ack 1 win 48677 (DF)
11:05:59.807754 802.1Q vlan#4093 P0 drjohnspc.20723 > webserverw.https: . ack 131 win 32768 (DF)
11:05:59.807759 802.1Q vlan#4094 P0 CSG.https > drjohnspc.20723: FP 103:130(27) ack 1 win 32768 (DF)
11:06:03.664813 802.1Q vlan#4094 P0 CSG.https > drjohnspc.20723: . 68:103(35) ack 1 win 32768 (DF)
11:06:12.642847 802.1Q vlan#4093 P0 drjohnspc.20723 > webserverw.https: R 1:1(0) ack 131 win 32768 (DF)
11:06:12.642862 802.1Q vlan#4094 P0 CSG.https > drjohnspc.20723: RP 103:130(27) ack 1 win 32768 (DF)

Notice the time stamp increasing by larger and larger leaps beginning with 11:05:00.664812. 11:05:02, 11:05:07, 11:05:15, 11:05:31 – the time keeps doubling! this is characteristic of a TCP retransmit. Note that all the other information is the same. It must be retransmitting the same packet. Why? Because it never got there! That seems to be the most likely reason. Now my conviction and hope that an application problem lies at the heart of the issue is starting to crumble. See why you need to keep an open mind? Your opinion can change to the polar opposite conclusion with the input of some additional data like that. Where to turn next?

There is a firewall inbetween the load balancer and the Internet. Now we will focus our attention on it. Could be that it dropped that packet and all the re-transmits.

Here’s the trace of that same conversation on the firewall’s internal interface (which faces the CSG) (I(O) means inbound(outbound) with respect to that interface):

11:04:59.441022  I IP CSG.https > drjohnspc.20723: P 2210302227:2210302262(35) ack 1714160833 win 32768
11:05:00.640781  I IP CSG.https > drjohnspc.20723: P 0:35(35) ack 1 win 32768
11:05:02.840742  I IP CSG.https > drjohnspc.20723: P 0:35(35) ack 1 win 32768
11:05:07.040729  I IP CSG.https > drjohnspc.20723: P 0:35(35) ack 1 win 32768
11:05:15.240780  I IP CSG.https > drjohnspc.20723: P 0:35(35) ack 1 win 32768
11:05:31.440571  I IP CSG.https > drjohnspc.20723: P 0:35(35) ack 1 win 32768
11:05:59.783366  I IP CSG.https > drjohnspc.20723: FP 35:62(27) ack 1 win 32768
11:06:03.640595  I IP CSG.https > drjohnspc.20723: . 0:35(35) ack 1 win 32768
^C

and the trace of the same thing on the firewall’s external interface, i.e., facing the Internet and drjohnspc:

11:03:59.269334  O IP CSG.https > drjohnspc.20723: P 2210302159:2210302194(35) ack 1714160833 win 32768
11:03:59.562011  I IP drjohnspc.20723 > CSG.https: . ack 35 win 64426
11:03:59.562139  O IP CSG.https > drjohnspc.20723: P 35:68(33) ack 1 win 32768
11:03:59.861970  I IP drjohnspc.20723 > CSG.https: . ack 68 win 64393

Notice what’s not present in the exterenal interface trace – all those re-transmits, or even the original packet.

Let’s summarize so far. One of those keep-alive packets from the XenApp server reached the firewall, but didn’t exit the firewall. the only possibility is that it got dropped by the firewall!

Now that was a lot of work, but who’s going to do it if not a patient and methodical IT person?

Results
We got a networking problem on our hands after all. Good thing we persisted in this investigation even when it looked like we were off the hook! Later it was confirmed that the firewall was “aggressively aging” its connections because it had either reached or was very close to its connection limit. The firewall connection limit was raised and the Citrix connection issues went away.

Let’s go back to that simplistic question that non-experts like to ask: what had changed that caused this problem? The change was external events – increased usage of that firewall. Network bandwidth, Internet usage – they all tend to increase over time. There were no changes done by either networking or the application group to cause this issue. A seasoned IT detective uses all available clues and arrives at the right conclusion. The “what has changed” question is normally very relevant, but it can’t be the only tool in your toolbox!

Case closed!

Categories
IT Operational Excellence

Dr John’s Laws of IT

Laws

Here are some laws of IT based on years of observation.  Some of these have very real and practical consequences.

1. IT infrastructure decays over time if left to itself – a sort of entropy sets in.  This is sort of  counterintuitive insofar as people know enough about troubleshooting a problem to ask “what changed.”  Sometimes the answer is nothing at all, or nothing you would ever think of.  For example, I once had an application server start to fail when “nothing had changed.”  The cause, found after mny hairs pulled out?  The log file it wrote to tried to exceed 2 GB on a 32-bit system.  It couldn’t write to its log any longer and the app server just froze up. 

1.1) Corollary to 1.  Neglect works great in the short term, but the way to go is judicious maintenance!  Neglect leads to a 2 GB log in the first place!

2. Things will always go wrong at some point.  It will usually not be for the reason you suspect.

2.1 Corollary to 2.  Effective monitoring is critical.  If you build something critical, build a means to monitor it.  Monitor foundational components as well so that when you need it, you can see what all was working when one thing went south.

3. Software support from large vendors is abysmal.  Most small and mid-sized vendors are no better.  The premise of almost all support I’ve encountered is The customer did something wrong.  The most relevant metric is How quickly can the case be closed?  If in that rare case the customer can prove fault by the vendor, Justify doing Absolutely Nothing about it for as Long as Possible.  And NEVER do somthing immediately useful like let the customer speak to a software developer who actually knows what he/she is talking about.

3.1 Corollary to 3. An IT Professional quickly develops all the skills possessed by front-line engineering who responds to support calls, and can solve most of the problems on his/her own, our of necessity, since the assistance given won’t take the problem further anyways.

Observations

Laws are universal.  The following are key observations that are generally true.

1. The more an IT person thinks about a problem, the better the solution.  Better means cheaper, faster, more elegant, even moving the category from impossible to the possible (and actually this happens frequently believe it or not).

2. Even a brilliant IT professional won’t think up all solutions alone.  Creative problem solving occurs best when there’s a couple brilliant IT professionals bouncing ideas off each other, with a few others at the ready to contribute for specialist opinions.

3. How to estimate the amount of time for an IT project:

    2 x (extimate from experienced IT professional) + constant

If many groups and external partners are involved, the multiplier should be increased to 3 or even 4.

This sounds facetious but it is not.  It is the unfortunate truth of the nature of our work and the unpredictability of the showstopper moments which always occur.

4. All a seasoned IT person needs to decide the impossible is possible is to hear that someone else is doing it!  The creative juices start flowing at that point.  Maybe it’s a competitive thing at that point.

5. Large IT organizations contain a large number of people who actually know surprisingly little about IT.  Small IT organizations are also not immune from this.

Categories
Admin IT Operational Excellence Web Site Technologies

Virtual Server not Working in F5 BigIP

OK. This posting is only directly applicable to the small number of people who run BigIP load balancers. And of that set, only a certaIn subset will likely ever have this situation. Nevertheless, it’s useful to document it. There are lessons in it for the rest of us, it shows the creative problem-solving process used in IT, or rather the creative process that should be used.

So I had a virtual server associated with a certain pool and it was operating fine for years. Then something changes. We want to associate a new name with this virtual server, but test it first, all while keeping the old name working. Well, this is a secured site by which I mean it is running https rather than http. There’s nothing intrinsic in the web site itself that ties it to a particular name. If this were a run-of-the-mill non-secure site you would solve this problem with DNS. Set up an alias and you’re good to go. But secured sites are a wee bit trickier. They present a certificate after all. And the certificate has just one name, at least ours does. Guess I can address multi-name certificates known as Subject Alternative Name CERTs in a separate post. And that name is the original DNS name. What to do? Simple. As any BigIP admin would tell you you create a new virtual server and associate it with a new IP and a new SSL profile containing the new certificate you just bought but the old pool. In DNS assign this new IP to your new DNS name. That’s all pretty straightforward

Having done all that, I blithely tested with lynx (iI’s an old curses-based browser which runs on old Unix systems. The main point is to not test with a complex browser where like Internet Explorer where you are never 100% sure if the problem lies with the browser. If I had it, I would test with curl, but it’s not on that system.). And…it hangs.

Now I’ll admit a lot of stupid things I did (which is typical of any good debugging session of an IT professional – some self-created red herrings accompany any decent sleuthing) and I ratchet up the debugging a notch. Check the web server logs. I see no log of my lynx accesses. Dig a little deeper still. Fire up a trace. Here’s a little time-saver. BigIP does have a tcpdump program, but it is a little stunted. Typically you have multiple interfaces on a BigIP. In this case I felt it pertinent to know if packets were getting to the BigIP from lynx, and then again, if those packets were leaving the BigIP and going to the web server. So the tip is that whereas a “normal” tcpdump might allow you to use the switch -i any to listen on all interfaces, that doesn’t work on BigIP. Use -i 0.0 instead. And of course restrict it somehow so that your own shell session’s packets won’t be picked up by the trace, or else you could be in for a nasty surprise of exponentially increasing traffic (a devastating situation perhaps worthy of its own blog entry!). In this case I added an expression port 443. So I have:

tcpdump -i 0.0 port 443

And, somewhat to my surprise (You should always have a hypothesis, even if it’s just a gut feeling: will this little test work, or not. Why?) not only were packets going from lynx to BigIp and then again to the web server, I could even see returned packets back from the web server to BigIp to lynx. But it was not a lot of packets. A SYN, SYN-ACK and maybe a single data packet and that’s about it. It should have been more chatty.

The more tests you can think of, the better, especially ones that emphasize the marginal differences between the thing that works and the one that doesn’t. One test along those lines: take this same virtual server and associate it with a different pool. I did that, and that test worked!

Next, I tried to access the web server using curl on the BigIP itself. I could, but not at first. First I used the local web server URL http://web_server_ip:443/. It hung my curl command, just like using lynx on the other server. Hmm. I then looked on the web server again. I notice that it has a certificate installed. Ah. So it’s actually running https. So try curl from BigIP again, but this time with the -k switch (insecure, meaning don’t verify the certificate issuer) and a url beginning with https rather than http. Bingo. It comes back with the home page. Now we’re getting somewhere.

Finally I look more closely at the virtual server setup for the old name, the one that works. I see that the server profile is SSL. It basically means that the traffic is encrypted when it hits the BigIP, and the server CERT is associated with the external name. The BigIP decrypts the traffic, then re-encrypts it before sending it along to the web server. The CERT for the second leg is a self-signed CERT and is never seen by users.

I had forgotten to set up my new test virtual server with the server SSL profile, so the second leg of traffic was not being re-encyrpted by the BigIP, even though the web server was only willing to engage in SSL communication with the BigIP. Once I updated the server profile, it all worked fine! Of course after getting the expected results from lynx I went to my desktop browser, just like a regular user, and successfully tested it there as well. You want to make sure your final tests are a realistic approximation of what the user will be doing. If that’s not all possible under your own control, bring in a user for testing.

Liked this article? Here’s another of my IT operational excellence articles that has a somewhat wider applicability.

Categories
IT Operational Excellence Linux

Grep is Slow as a Snail in SLES 11 – Solved

I had written earlier about the performance problems of Suse Linux Enterprise Server v 11  Service Pack 1 (SLES 11 SP1)  under VMWare: http://drjohnstechtalk.com/blog/2011/06/performance-degradation-with-sles-11-sp1-under-vmware/.  What I hadn’t fully appreciated at that time is that part of the problem could be with the command grep itself.  Further investigation has convinced me that grep as implemented under SLES 11 SP 1 X86_64 is horrible.  It is seriously broken. The following results are invariant under both a VM and a physical server.

Methodology 1

A cksum shows that grep has changed between SLES 10 SP 3 and SLES 11 SP 1.  I’m not sure what the changes are.  So I performed an strace while grep’ing a short file to see if there are any extra system calls which occur under SLES 11 SP 1.  There are not.

I copied the grep binary from SLES 10 SP 3 to a SLES 11 SP 1 system.  I was afraid this wouldn’t work because it might rely on dynamic libraries which also could have changed.  However this appears to not be the case and the grep binary from the SLES 10 system is about 19 times faster, running on the same SLES 11 system!

Methodology 2

I figure that I am a completely amateur programmer.  If with all my limitations I can implement a search utility that does considerably better than the shell command grep, I can fairly decisively conclude that grep is broken.  Recall that we already have comparisons that show that grep under SLES 10 SP 3 is many times faster than under SLES 11 SP 1.

Results

The table summarizes the findings. All tests were on a 109 MB file which has 460,000 lines.

OS

Type of Grep

Time (s)

SLES 11 SP 1

built-in

42.6

SLES 11 SP 1

SLES 10 SP 3 grep binary

2.5

SLES 11 SP 1

Perl grep

1.1

SLES 10 SP 3

built-in

1.2

SLES 10 SP 3

Perl grep

0.35 s

The Code for Perl Grep

Hey, I don’t know about you, but I only use a fraction of the features in grep. The switches i and v cover about 99% of what I do with it. Well, come to think of it I do use alternate expressions in egrep (w/ the “|” character), and the C switch (provides context by including surrounding lines) can sometimes be really helpful. The i (filenames only) and n (include line numbers) look useful on paper, but you almost never end up needing them. Anyways I simply didn’t program those things to keep it simple. Maybe later. To make it as fast as possible I avoided anything I thought the interpreter might trip over, at the expense of repeating code snippets multiple times. At some point (allowing another switch or two) my approach would be ludicrous as there would be too many combinations to consider. But at least in my testing it does function just like grep, only, as you see from the table above, it is much faster than grep. If I had written it in a compiled language like C it should go even faster still. Perl is an interpreted language so there should always be a performance penalty in using it. The advantage is of course that it is so darn easy to write useful code.

#!/usr/bin/perl
# J.Hilgart, 6/2011
# model grep implementation in Perl
# feel free to borrow or use this, but it will not be supported
use Getopt::Std;
$DEBUG = 0;
# Get the command line options.
getopts('iv');
# the search string has to be present
$mstr = shift @ARGV;
usage() unless $mstr;
$mstr =~ s/\./\\./g;
# the remaining arguments are the files to be searched
$nofiles = @ARGV;
print "nofiles: $nofiles\n" if $DEBUG;
$filePrefix = $nofiles > 1 ? "$_:" : "";
 
# call subroutine based on arguments present
optiv() if $opt_i && $opt_v;
opti()  if $opt_i;
optv()  if $opt_v;
normal();
################################
sub normal {
foreach (@ARGV) {
  open(FILE,"$_") || die "Cannot open $_!!\n";
  while(<FILE>) {
# print filename if there is more than one file being searched
    print "$filePrefix$_" if /$mstr/;
  }
  close(FILE);
}
if (! $nofiles) {
# no files specified, use STDIN
while(<STDIN>) {
  print if /$mstr/;
}
}
exit;
} # end sub normal
###############################
sub opti {
foreach (@ARGV) {
  open(FILE,"$_") || die "Cannot open $_!!\n";
  while(<FILE>) {
    print "$filePrefix$_" if /$mstr/i;
  }
  close(FILE);
}
if (! $nofiles) {
# no files specified, use STDIN
while(<STDIN>) {
  print if /$mstr/i;
}
}
exit;
} # end sub opti
#################################
sub optv {
foreach (@ARGV) {
  open(FILE,"$_") || die "Cannot open $_!!\n";
  while(<FILE>) {
    print "$filePrefix$_" unless /$mstr/;
  }
  close(FILE);
}
if (! $nofiles) {
# no files specified, use STDIN
while(<STDIN>) {
  print unless /$mstr/;
}
}
exit;
} # end sub optv
##############################
sub optiv {
foreach (@ARGV) {
  open(FILE,"$_") || die "Cannot open $_!!\n";
  while(<FILE>) {
    print "$filePrefix$_" unless /$mstr/i;
  }
  close(FILE);
}
if (! $nofiles) {
# no files specified, use STDIN
while(<STDIN>) {
  print unless /$mstr/i;
}
}
exit;
} # end sub optiv
sub usage {
# I never did finish this...
}

Conclusion
So built-in grep performs horribly on SLES 11 SP 1, about 17 times slower than the SLES 10 SP 3 grep. I wonder what an examination of the source code would reveal? But who has time for that? So I’ve shown a way to avoid it entirely, by using a perl grep instead – modify to suit your needs. It’s considerably faster than what the system provides, which is really sad since it’s an amateur, two-hour effort compared to the decade+ (?) of professional development on Posix grep. What has me more concerned is what haven’t I found, yet, that also performs horribly under SLES 11 SP 1? It’s like deer on the side of the road in New Jersey – where there’s one there’s likely to be more lurking nearby : ) .

Follow Up
We will probably open a support case with Novell. I am not very optimistic about our prospects. This will not be an easy problem for them to resolve – the code may be contributed, for instance. So, this is where it gets interesting. Is the much-vaunted rapid bug-fixing of open source really going to make a substantial difference? I would have to look to OpenSUSE to find out (where I suppose the fixed code would first be released), which I may do. I am skeptical this will be fixed this year. With luck, in a year’s time.

7/15 Update
There is a newer version of grep available. Old version: grep-2.5.2-90.18.41; New version: grep-2.6.3-90.18.41 Did it fix the problem? Depends how low you want to lower the bar. It’s a lot better, yes. But it’s still three times slower than grep from SLES 10 SP3. So…still a long ways to go.

9/7 Update – The Solution
Novell came through today, three months later. I guess that’s better than I pessimistically predicted, but hardly anything to brag about.

Turns out that things get dramatically better if you simple define the environment variable LC_ALL=POSIX. They do expect a better fix with SLES 11 SP 2, but there’s no release date for that yet. Being a curious sort, I revisited SLES 10 SP3 with this environment variable defined and it also considerably improved performance there as well! This variable has to do with the Locale and language support. Here’s a table with some recent results. Unfortunately the SLES 11 SP 1 is a VM, and SLES 10 SP3 is a physical server, although the same file was used. So the thing to concentrate on is the improvement in performance of grep with vs without LC_ALL defined.

OS

LC_ALL=POSIX defined?

Time (s)

SLES 11 SP 1

no

6.9

SLES 11 SP 1

yes

0.36

SLES 10 SP 3

no

0.35

SLES 10 SP 3

yes

0.19 s

So if you use SLES 10/11, make sure you have a

export LC_ALL=POSIX

defined somewhere in your profile if you plan to use grep very often. It makes a 19x performance improvement in SLES 11 and almost a 2x performance improvement under SLES 10 SP3.

Related
If you like the idea of grep but want a friendlier interface, I was thinking I ought to mention Splunk. A Google search will lead you to it. It started with a noble concept – all the features of grep, plus a convenient web interface so you never have to get yuor hands dirty and actually log into a Linux/unix system. It was like a grep on steroids. But then in my opinion they ruined a simple utility and blew it up with so many features that it’ll take hours to just scratch the surface of its capabilities. And I’m not even sure a free version is still available. Still, it might be worth a look in some cases. In my case it also slowed down searching though supposedly it should have sped them up.

And to save for last what should have come first, grep is a search utility that’s great for looking at unstructured (not in a relational database) data.

Categories
IT Operational Excellence Network Technologies

Swapping Servers while Preserving IPs – What Might Go Wrong

The Setup

I had this experience last week. I needed to swap a virtual server in place of a physical server. I had all the access I needed on both servers to do the necessary network changes, which is how I customarily do these things. I implement network configuration changes as opposed to, say, plugging cables in and pulling others out.

The Issue

Anyways, this sounded straightforward enough.  The physical server had  a backup interface on a different segment – one that I could access from the backup interface of another server also on that backup segment (so that I wouldn’t disconnect myself as I was shutting down the primary interface).  So as I was saying, simple: shutdown the primary interface on the physical server, configure the VM’s intereface similar to the physical server, reboot the virtual server so the interface changes take effect and can be seen to be working even after a reboot.  But it didn’t work, or more precisely, it half-worked.  Why?

A Trail of Hints

Here’s what I didn’t yet say that turns out that has a significant role though I did not know it at the time.  See, that interface had two IPs defined, a primary and a virtual, I’ll call it secondary since virtual is a loaded term, IP, both on the same segment, i.e., eth0 and eth0:ns2v.  After the switch eth0 was working OK, but eth0:ns2v was not!  I also need to mention how they are used, from a network perspective, to see if you are following the hints and can guess what the problem might be before I spell it out.  I have different DNS servers bound to these interfaces.  They are resolving DNS servers.  It actually does not matter (another hint!) but the OS is SLES 11.

Final hint: eth0 probably took a few minutes to work, eth0:ns2v was not working even after 17 minutes.  By not working I mean that I could see the interface on the VM come up OK, my DNS server was bound to it and I could send it queries from the VM itself.  But queries from servers on other segments to this secondary were not being returned.  I tried a trace on the VM: tcpdump -i eth0:ns2v  (OK. I lied.  More hints on the way.  This is how you solve such problems!), while doing a PING from a server on another segment. Nothing coming in.  PINGs and DNS queries to primary interface were coming in fine however.  So I know I had my routing correct.

Biggest hint of all: I could PING this secondary interface from another server on the same segment!

So what the heck is going on here?  And it’s late at night of course so no one is disturbed by this change.  I begin to suspect the router.  After all, everything else is good, right?

Do I bother the network guy to fix his router?  For me that’s akin to admitting failture to plan.  So, no, I don’t want to.  That secondary interface isn’t that important.  it could wait until morning.  But it nags at me…

First Inkling

Then it hits me.  The Aha moment.  Let me back up.  Like I said I become convinced that the router is simply wrong.  But it’s one device I do not have any administrative access to.  What do I mean by “wrong” from a network engineer’s point-of-view?  I became convinced that its ARP table hadn’t aged out its entry for the secondary IP as it ought to have.  All hosts maintain an ARP table which stores the correspondence between IP address and MAC address of other devices on the same segment.  It’s how a device “knows” to talk to the right device when an application specifies an IP address – by correctly converting it into a MAC address.  So, you see, I preserved IPs.  But what if  the router held onto the old MAC address for the secondary IP?  It would try to send traffic destined for that IP which came in from other segments to the wrong place, or no place at all, since the old MAC was now offline.  I’m not exactly sure what happens to those packets.  I’d have to investigate and think about it some more.  Could be they get sent out via the switch but dropped by the switch as there’s no place to deliver them.

But the one IP, the main one, was working.  If you can’t solve what’s wrong, it’s a good idea to review what’s gone right amongst the things which are closely related.  And try to understand the difference in the two cases.

Aha Moment

That was the real Aha moment.  A server is always doing a bit of communication.  This and that chatter.  I realized the router was seeing some of that, and that it was all coming from the main IP.  Why? Because that’s just how things work in IPv4.  Usually.  So it made some sense that the router would update the ARP entry of the main IP.  After all it was seeing these packets come to it which contained the new MAC address/old IP address.  So it probably “knew” to update its ARP table with the new MAC from those packets.  But it wasn’t getting any packets that contained the new MAC address/old secondary IP address combination!  Knowing this situation, you would hope, as a reasonable person, that there would be an ARP table timer on all the ARP entries and they would simply age out and be renewed from time-to-time to prevent just such a situation.  You would hope, but it wasn’t happening.  17 minutes is a long time.  I later learned that this was an old router.  Supposedly it has an ARP timer of five or ten minutes.  But I know that isn’t correct. 

But I did not know any of that at the time.  I knew the main interface worked, the secondary didn’t.   Packets were streaming out of the primary to the router, no packets were streaming from the secondary to the router.  So I asked myself: how can I send packets from the secondary interface??  How do you do that?

I suspected two ways offhand.  I’m sure there are lots of others.  I suspected PING could do it.  Check the man pages.  Yup. ping -I interface_address.  Bingo.  I decided to ping the router, which is, of course, my default gateway, with the secondary IP as source.  The packets were returned.  Good.  Then I noticed that my monitors were completing.  I checked seconds later.  Sure enough, I could now reach that secondary IP from other segments.  Yeah!  Problem resolved.

Mystery solved, and no cold call to the networking group required.

Tying Up the Loose Ends

What would I have asked for if I had called the networking group?  I would have told them I suspect the ARP table on their router was not updating and could they please delete the ARP entry for that secondary IP, that’s what.  That’s what I would have done right away myself if I had had that kind of access.  On *ix devices there is usually a command like arp -d ip_address to delete a specific ARP entry. 

This also explains why another device on that segment could see that secondary IP while at the same time the router couldn’t.  The other device obviously had a more well-behaved ARP time-out mechanism.  Or perhaps it  didn’t but it had had no ARP entry for that secondary IP until after the server switch?  And of course the way modern switches work the traffic is all directed and carved up.  So the communication between those two devices, which would have contained nice and uptodate MAC/IP entries was completely segregated and none of it would have been seen by the router, so in that sense wasn’t helping any.  And what was the other way to send packets from a specific IP?  dig.  I use dig constantly in my capacity as a DNS admin, so I was aware it also allows you to specify your source IP address (dig -b).  Another way that most people would have thought of?  nmap.  I haven’t really checked, but I’m willing to bet nmap could easily also have been used.  But that’s kinf of a “nasty” utility and actually isn’t normally available on self-respecting servers.  It certainly wasn’t on this one.  sendmail MTA could also be used for this same purpose (setting the source IP), but that’s a pain in the rear to set up.  As I say there are probably lots of other utilities that do this.  nc or netcat, depending on your version of Linux, may also be promising.  The aspiring programmers could write a simple PERL (or pick your language) client to do the same thing, etc.   I now see that even telnet allows you to specify your source IP with the -b switch.  So it seems to be a fairly common feature – though not universal, just try to find it on an FTP client – in most networking utilities.

An IT person benefits from having lots of tools which accomplish the same things in different ways.

More Details As Time Permits