Categories
Admin IT Operational Excellence Web Site Technologies

Virtual Server not Working in F5 BigIP

OK. This posting is only directly applicable to the small number of people who run BigIP load balancers. And of that set, only a certaIn subset will likely ever have this situation. Nevertheless, it’s useful to document it. There are lessons in it for the rest of us, it shows the creative problem-solving process used in IT, or rather the creative process that should be used.

So I had a virtual server associated with a certain pool and it was operating fine for years. Then something changes. We want to associate a new name with this virtual server, but test it first, all while keeping the old name working. Well, this is a secured site by which I mean it is running https rather than http. There’s nothing intrinsic in the web site itself that ties it to a particular name. If this were a run-of-the-mill non-secure site you would solve this problem with DNS. Set up an alias and you’re good to go. But secured sites are a wee bit trickier. They present a certificate after all. And the certificate has just one name, at least ours does. Guess I can address multi-name certificates known as Subject Alternative Name CERTs in a separate post. And that name is the original DNS name. What to do? Simple. As any BigIP admin would tell you you create a new virtual server and associate it with a new IP and a new SSL profile containing the new certificate you just bought but the old pool. In DNS assign this new IP to your new DNS name. That’s all pretty straightforward

Having done all that, I blithely tested with lynx (iI’s an old curses-based browser which runs on old Unix systems. The main point is to not test with a complex browser where like Internet Explorer where you are never 100% sure if the problem lies with the browser. If I had it, I would test with curl, but it’s not on that system.). And…it hangs.

Now I’ll admit a lot of stupid things I did (which is typical of any good debugging session of an IT professional – some self-created red herrings accompany any decent sleuthing) and I ratchet up the debugging a notch. Check the web server logs. I see no log of my lynx accesses. Dig a little deeper still. Fire up a trace. Here’s a little time-saver. BigIP does have a tcpdump program, but it is a little stunted. Typically you have multiple interfaces on a BigIP. In this case I felt it pertinent to know if packets were getting to the BigIP from lynx, and then again, if those packets were leaving the BigIP and going to the web server. So the tip is that whereas a “normal” tcpdump might allow you to use the switch -i any to listen on all interfaces, that doesn’t work on BigIP. Use -i 0.0 instead. And of course restrict it somehow so that your own shell session’s packets won’t be picked up by the trace, or else you could be in for a nasty surprise of exponentially increasing traffic (a devastating situation perhaps worthy of its own blog entry!). In this case I added an expression port 443. So I have:

tcpdump -i 0.0 port 443

And, somewhat to my surprise (You should always have a hypothesis, even if it’s just a gut feeling: will this little test work, or not. Why?) not only were packets going from lynx to BigIp and then again to the web server, I could even see returned packets back from the web server to BigIp to lynx. But it was not a lot of packets. A SYN, SYN-ACK and maybe a single data packet and that’s about it. It should have been more chatty.

The more tests you can think of, the better, especially ones that emphasize the marginal differences between the thing that works and the one that doesn’t. One test along those lines: take this same virtual server and associate it with a different pool. I did that, and that test worked!

Next, I tried to access the web server using curl on the BigIP itself. I could, but not at first. First I used the local web server URL http://web_server_ip:443/. It hung my curl command, just like using lynx on the other server. Hmm. I then looked on the web server again. I notice that it has a certificate installed. Ah. So it’s actually running https. So try curl from BigIP again, but this time with the -k switch (insecure, meaning don’t verify the certificate issuer) and a url beginning with https rather than http. Bingo. It comes back with the home page. Now we’re getting somewhere.

Finally I look more closely at the virtual server setup for the old name, the one that works. I see that the server profile is SSL. It basically means that the traffic is encrypted when it hits the BigIP, and the server CERT is associated with the external name. The BigIP decrypts the traffic, then re-encrypts it before sending it along to the web server. The CERT for the second leg is a self-signed CERT and is never seen by users.

I had forgotten to set up my new test virtual server with the server SSL profile, so the second leg of traffic was not being re-encyrpted by the BigIP, even though the web server was only willing to engage in SSL communication with the BigIP. Once I updated the server profile, it all worked fine! Of course after getting the expected results from lynx I went to my desktop browser, just like a regular user, and successfully tested it there as well. You want to make sure your final tests are a realistic approximation of what the user will be doing. If that’s not all possible under your own control, bring in a user for testing.

Liked this article? Here’s another of my IT operational excellence articles that has a somewhat wider applicability.

Categories
IT Operational Excellence Linux

Grep is Slow as a Snail in SLES 11 – Solved

I had written earlier about the performance problems of Suse Linux Enterprise Server v 11  Service Pack 1 (SLES 11 SP1)  under VMWare: http://drjohnstechtalk.com/blog/2011/06/performance-degradation-with-sles-11-sp1-under-vmware/.  What I hadn’t fully appreciated at that time is that part of the problem could be with the command grep itself.  Further investigation has convinced me that grep as implemented under SLES 11 SP 1 X86_64 is horrible.  It is seriously broken. The following results are invariant under both a VM and a physical server.

Methodology 1

A cksum shows that grep has changed between SLES 10 SP 3 and SLES 11 SP 1.  I’m not sure what the changes are.  So I performed an strace while grep’ing a short file to see if there are any extra system calls which occur under SLES 11 SP 1.  There are not.

I copied the grep binary from SLES 10 SP 3 to a SLES 11 SP 1 system.  I was afraid this wouldn’t work because it might rely on dynamic libraries which also could have changed.  However this appears to not be the case and the grep binary from the SLES 10 system is about 19 times faster, running on the same SLES 11 system!

Methodology 2

I figure that I am a completely amateur programmer.  If with all my limitations I can implement a search utility that does considerably better than the shell command grep, I can fairly decisively conclude that grep is broken.  Recall that we already have comparisons that show that grep under SLES 10 SP 3 is many times faster than under SLES 11 SP 1.

Results

The table summarizes the findings. All tests were on a 109 MB file which has 460,000 lines.

OS

Type of Grep

Time (s)

SLES 11 SP 1

built-in

42.6

SLES 11 SP 1

SLES 10 SP 3 grep binary

2.5

SLES 11 SP 1

Perl grep

1.1

SLES 10 SP 3

built-in

1.2

SLES 10 SP 3

Perl grep

0.35 s

The Code for Perl Grep

Hey, I don’t know about you, but I only use a fraction of the features in grep. The switches i and v cover about 99% of what I do with it. Well, come to think of it I do use alternate expressions in egrep (w/ the “|” character), and the C switch (provides context by including surrounding lines) can sometimes be really helpful. The i (filenames only) and n (include line numbers) look useful on paper, but you almost never end up needing them. Anyways I simply didn’t program those things to keep it simple. Maybe later. To make it as fast as possible I avoided anything I thought the interpreter might trip over, at the expense of repeating code snippets multiple times. At some point (allowing another switch or two) my approach would be ludicrous as there would be too many combinations to consider. But at least in my testing it does function just like grep, only, as you see from the table above, it is much faster than grep. If I had written it in a compiled language like C it should go even faster still. Perl is an interpreted language so there should always be a performance penalty in using it. The advantage is of course that it is so darn easy to write useful code.

#!/usr/bin/perl
# J.Hilgart, 6/2011
# model grep implementation in Perl
# feel free to borrow or use this, but it will not be supported
use Getopt::Std;
$DEBUG = 0;
# Get the command line options.
getopts('iv');
# the search string has to be present
$mstr = shift @ARGV;
usage() unless $mstr;
$mstr =~ s/\./\\./g;
# the remaining arguments are the files to be searched
$nofiles = @ARGV;
print "nofiles: $nofiles\n" if $DEBUG;
$filePrefix = $nofiles > 1 ? "$_:" : "";
 
# call subroutine based on arguments present
optiv() if $opt_i && $opt_v;
opti()  if $opt_i;
optv()  if $opt_v;
normal();
################################
sub normal {
foreach (@ARGV) {
  open(FILE,"$_") || die "Cannot open $_!!\n";
  while(<FILE>) {
# print filename if there is more than one file being searched
    print "$filePrefix$_" if /$mstr/;
  }
  close(FILE);
}
if (! $nofiles) {
# no files specified, use STDIN
while(<STDIN>) {
  print if /$mstr/;
}
}
exit;
} # end sub normal
###############################
sub opti {
foreach (@ARGV) {
  open(FILE,"$_") || die "Cannot open $_!!\n";
  while(<FILE>) {
    print "$filePrefix$_" if /$mstr/i;
  }
  close(FILE);
}
if (! $nofiles) {
# no files specified, use STDIN
while(<STDIN>) {
  print if /$mstr/i;
}
}
exit;
} # end sub opti
#################################
sub optv {
foreach (@ARGV) {
  open(FILE,"$_") || die "Cannot open $_!!\n";
  while(<FILE>) {
    print "$filePrefix$_" unless /$mstr/;
  }
  close(FILE);
}
if (! $nofiles) {
# no files specified, use STDIN
while(<STDIN>) {
  print unless /$mstr/;
}
}
exit;
} # end sub optv
##############################
sub optiv {
foreach (@ARGV) {
  open(FILE,"$_") || die "Cannot open $_!!\n";
  while(<FILE>) {
    print "$filePrefix$_" unless /$mstr/i;
  }
  close(FILE);
}
if (! $nofiles) {
# no files specified, use STDIN
while(<STDIN>) {
  print unless /$mstr/i;
}
}
exit;
} # end sub optiv
sub usage {
# I never did finish this...
}

Conclusion
So built-in grep performs horribly on SLES 11 SP 1, about 17 times slower than the SLES 10 SP 3 grep. I wonder what an examination of the source code would reveal? But who has time for that? So I’ve shown a way to avoid it entirely, by using a perl grep instead – modify to suit your needs. It’s considerably faster than what the system provides, which is really sad since it’s an amateur, two-hour effort compared to the decade+ (?) of professional development on Posix grep. What has me more concerned is what haven’t I found, yet, that also performs horribly under SLES 11 SP 1? It’s like deer on the side of the road in New Jersey – where there’s one there’s likely to be more lurking nearby : ) .

Follow Up
We will probably open a support case with Novell. I am not very optimistic about our prospects. This will not be an easy problem for them to resolve – the code may be contributed, for instance. So, this is where it gets interesting. Is the much-vaunted rapid bug-fixing of open source really going to make a substantial difference? I would have to look to OpenSUSE to find out (where I suppose the fixed code would first be released), which I may do. I am skeptical this will be fixed this year. With luck, in a year’s time.

7/15 Update
There is a newer version of grep available. Old version: grep-2.5.2-90.18.41; New version: grep-2.6.3-90.18.41 Did it fix the problem? Depends how low you want to lower the bar. It’s a lot better, yes. But it’s still three times slower than grep from SLES 10 SP3. So…still a long ways to go.

9/7 Update – The Solution
Novell came through today, three months later. I guess that’s better than I pessimistically predicted, but hardly anything to brag about.

Turns out that things get dramatically better if you simple define the environment variable LC_ALL=POSIX. They do expect a better fix with SLES 11 SP 2, but there’s no release date for that yet. Being a curious sort, I revisited SLES 10 SP3 with this environment variable defined and it also considerably improved performance there as well! This variable has to do with the Locale and language support. Here’s a table with some recent results. Unfortunately the SLES 11 SP 1 is a VM, and SLES 10 SP3 is a physical server, although the same file was used. So the thing to concentrate on is the improvement in performance of grep with vs without LC_ALL defined.

OS

LC_ALL=POSIX defined?

Time (s)

SLES 11 SP 1

no

6.9

SLES 11 SP 1

yes

0.36

SLES 10 SP 3

no

0.35

SLES 10 SP 3

yes

0.19 s

So if you use SLES 10/11, make sure you have a

export LC_ALL=POSIX

defined somewhere in your profile if you plan to use grep very often. It makes a 19x performance improvement in SLES 11 and almost a 2x performance improvement under SLES 10 SP3.

Related
If you like the idea of grep but want a friendlier interface, I was thinking I ought to mention Splunk. A Google search will lead you to it. It started with a noble concept – all the features of grep, plus a convenient web interface so you never have to get yuor hands dirty and actually log into a Linux/unix system. It was like a grep on steroids. But then in my opinion they ruined a simple utility and blew it up with so many features that it’ll take hours to just scratch the surface of its capabilities. And I’m not even sure a free version is still available. Still, it might be worth a look in some cases. In my case it also slowed down searching though supposedly it should have sped them up.

And to save for last what should have come first, grep is a search utility that’s great for looking at unstructured (not in a relational database) data.

Categories
Linux

Gnu Parallel Really Helps With Zcat

I happened across what I think will turn out to be a very useful open source application: Gnu parallel.  Let me explain how I got to that point.  I was uncompressing gzip’d files, of which I have lots.  So many, in fact, it can take the better part of a day to go through them all.  The server I was using is four years old but I don’t have access to anything better.  In fact it seems processor speed has sort of plateaud.  Moore’s Law only applies if you count the number of processors, which are multiplying like rabbits.  So I start to think that maybe my single-threaded approach is wrong, see? I need to scale horizontally, not vetically.

It just happens that in my former life I spent quite some effort into parallelizing a compute-intensive program I wrote.  I can’t believe that it’s still hanging around on the Internet!  I haven’t looked for it for at least 17 years: my FERMISV  Monte Carlo.  So I thought, what if I could parallelize my zcat|grep?  Writing such a thing from scratch would be too time-consuming, but after searching for parallelize zcat I quickly found Gnu Parallel.  It’s one thing to identify such a program in a Google search, another to get it working.  Believe me, I’ve put in my time with some of these WordPress plugins.

The upshot is that, yes, Gnu Parallel actually works and it’s not hard to learn to use.  They have a nice Youtube video.  For me I employed syntax like:  

ls *gz|time parallel -k "zcat {}|grep 192.168.23.34"  > /tmp/matches

.  This, of course, is to be compared with the normal operation I have been doing for eons:

time zcat *gz|grep 192.168.23.34 > /tmp/matches

.  The “time” is just for benchmarking and normally wouldn’t be present. 

The results are that the matches file produced in the two different ways are identical.  That’s good!  The -k switch ensured the order is preserved in the output, and anyways, it didn’t cost me any time. Otherwise the order is not guaranteed. I tested on a server with four cores.  Gnu Parallel reported that it used 375% of my CPU!  It finished in less than half the time of my normal way of doing things.  I need to put it through more real-world exercises, but it really looks promising.  I never got onto the xargs bandwagon (perhaps I should be ashamed to admit it?), but Gnu Parallel can do a lot of that same type of thing, but in my opinion is more intuitive.  Take a look!

Categories
Linux

Performance Degradation With SLES 11 SP1 Under VMWare

Some friends and I found severe performance degradation using SLES 11 SP1 under VMWare.  That’s Suse Linux Enterprise Server Service Pack 1.  I’m still trying to get my head around it – there are lots of variables to control for. Later update For my analysis of the specific problem with grep on SLES 11 SP1, please go here: http://drjohnstechtalk.com/blog/2011/06/grep-is-slow-as-a-snail-in-sles-11/.

The Method

I start with a 20 Mb gzip’d file.  Let’s call it file.gz.  Uncompressed it is 107 MB.  I run

time zcat file.gz > /dev/null

I run it several times and report the lowest number which I am consistently able to reproduce. That feels the fairest way to me to benchmark in this case.

On a fast, 3 GHz physical server running SLES 10 SP3 it takes 0.63 s. Let’s call that server Physical.

Then I add in something to make it useful: grep’ing for an IP address:

time zcat file.gz|grep 192.168.23.34 > /dev/null

On Physical that takes 0.65 s.

My Amazon EC2 image – let’s call it Amazon-EC2 – runs ubuntu 10.10 on a 2.6 GHz VM. To learn CPU speed in ubuntu:

cat /proc/cpuinfo|grep MHz

My SLES 11 SP1 is a guest VM on VMWare ESX. It has a 2.4 GHz processor. Let’s call it SLES11SP1. For CPU speed in SLES*:

dmesg|grep -i cpu

and look for the line that says Intel … GHz.

* 7/1/2011 Correction The correct way to get the processor speed is

grep -i /proc/cpuinfo

The cpu Mhz line shows the running speed. this also seems to work in RHEL (Redhat) and Debian distributions such as Ubuntu. I’ve looked at several models. Usually the model name – which is what you get from the dmesg way – lists a speed that is the same as the cpu MHz given the 1000x difference between GHz and MHz, but not always! I have come across a recently purchased server with a surprisingly slow clock speed, and one that is quite different from the CPU name:

model name      : Intel(R) Xeon(R) CPU           E7520  @ 1.87GHz
cpu MHz         : 1064.000

Who knew you could even buy a server-class CPU that slow these days?

For comparison I have a SLES 10 SP3 VM under VMWare. It also has a 2.4 GHz CPU. SLES10SP3. All servers are X86_64.

The Results
The amazing results:

Server zcat time zcat|grep IP time
Physical 0.63 s 0.65 s
Amazon-EC2 0.73 s 1.06 s
SLES11SP1 0.99 s 5.8 s
SLES10SP3 0.78 s 0.93 s

Analysis
I’ve tried many more variants than displayed in that table, but you get the idea. All VMs are slower than all physical servers tested in all tests. Most discouragingly, the SLES11 SP3 is five or six times slower than a comparable physical server for the real-life test involving grep. I used the same file in all tests.

Conclusions
Virtualization is not a panacea. It has its role, but also its limitations. And either something is wrong with SLES 11 SP1, which I doubt, or something is wrong with the way I set it up, despite the fact that I’ve tried a few variants. We’re going to try it on a physical server next. I’ll update this post if anything noteworthy happens.

Update
I got it tested on a physical server now, too. A HP G6 w/ 4 Gb RAM. SLES 11 SP1. The CPU identified itself as Xeon E5504 @ 2.0 GHz. Here are the awful timings:

Server zcat time zcat|grep IP time
SLES 11 SP1 Physical 0.90 s 4.8 s

That shows that SLES 11 SP1 itself is causing my performance degradation. Not every instance of SLES 11 is faulty, however. It turns out that IBM’s Watson is running it! http://whatis.techtarget.com/definition/ibm-watson-supercomputer.html

For the record, the kernel in SLES 11 SP1 is 2.6.32.12-0.7, while in SLES 10 SP3 it is 2.6.16.60-0.68.1. On a RHEL 5.6 VM (I did not bother to list its results), where the kernel was 2.6.18-238.1.1, the degradation was not nearly so bad either. I wonder if the kernel version is having a major impact? Perhaps not. The kernel in ubuntu 10.10 looks pretty new – 2.6.35-24. I am running

uname -a

to see the kernel version.

Further Update
We also got to test with SLES 10 SP4 VM. You see from the table that it is well-behaved. It had 2 CPUs which identified themselves as X6550 @ 2.0 GHz. 4 GB RAM. Kernel 2.6.16.60-0.85.1

Server zcat time zcat|grep IP time
SLES 10 SP4 VM 0.96 s 1.0 s
Categories
Web Site Technologies

Changing The Font in WordPress TwentyTen Theme

I still think WordPress is a mess!  But it’s better than what I cuold do on my own, so I’m sticking with it for now.  The simplest things quickly devolve into an exercise for a jaded IT veteran.  Don’t even get me started on replacing the header image while using an Ubuntu server which conveniently does not come with php5-gd.

Here is something I actually manmaged to do without hours of effort.  Replacing the font style in my posts.  Everyone knows that sans-serif fonts are more readable, look more professional and are more compact.  Not sure why the twenyten theme does not use it for the body of the posts.  What I found, without being a CSS master, is that if you change line 118 of style.css to

        font-family: Arial, Helvetica, sans-serif;

It works to change the font of all your posts.

Before the change that line reads like this:

That was it for me!  Not bad, eh?

For a more sophisticated treatment, you should consider doing all your customizations in a child theme. It can’t be hard because I just managed to do it! It looks to be a more elegant and robust approach, which appeals to me, because you leave all the files in the parent theme alone, and only override the files which actually need to be changed. Read about it in http://codex.wordpress.org/Child_Themes.

Categories
Web Site Technologies

WordPress Templates are a Nightmare

Like a typical opensource effort, WordPress is a mixed bag. It’s wonderful for scripters like me to get so much access to the source. But the documentation and concepts are opaque, and this is coming from a seasoned IT veteran. Could they have possibly made it more complicated?

I hope to help you cut through the inscrutable explanations on such pages as http://codex.wordpress.org/Template_Tags and cut to the chase for knowing how to change what matters, whether or not, like me, you really understand what the heck they’re talking about.

Say you want to modify something in the appearance of your posts.  That’s what I wanted.  Once I learned the “easy way” to install plugins (see http://drjohnstechtalk.com/blog/2011/06/security-considerations-for-wordpress-plugins-and-upgrades/), I wanted to get a plugin to count the millions of expected visitors, ha, ha!

Now more comfortable with plugins, I actually installed WP-PostViews using my SmartPhone.  Cool, right?  Except that I found installation is one thing, configuring it to actually do something another.  Fortunately, I do have a PhD in a technical field, so I refused to be daunted.

I wanted to display the view count above or below each post.  From the exceedingly poor documentation available on WP-PostViews, I gathered that I needed to insert this php code:

<?php if(function_exists('the_views')) { the_views(); } ?>

into one or more of my template files to display the view count.  The (incorrect) PostViews documentation said just put it into index.php, inside the section

<?php while ( have_posts() ) : the_post(); ?>

Great.  That simply doesn’t exist in my index.php in my theme (twentyten).

So now we’re looking at all these files in that directory, wp-content/themes/twentyten, to figure out which may be the right one:

404.php         comments.php   loop-attachment.php  page.php            tag.php
archive.php     footer.php     loop-page.php        search.php
attachment.php  functions.php  loop.php             sidebar-footer.php
author.php      header.php     loop-single.php      sidebar.php
category.php    index.php      onecolumn-page.php   single.php

As I promised I’ll cut through all the bluster about themes, templates, hierarchies and other WordPress nonsense.  My degree is in experimental physics.  I experiment.  By experimentation and some tiny understanding of their concepts I can now say you need to change these two files:

loop.php
loop-single.php

That’s it.  I just saved you three hours of useless research.

Update
I see they try to make it easier for you by allowing you to edit the template files from within the admin GUI, including some function documentation. It still leaves a lot to be desired.

To be continued…

Categories
Web Site Technologies

Security Considerations for WordPress Plugins and Upgrades

The following comments apply to WordPress v 3.1.3 and may not apply to earlier versions, with which I have no familiarity.

WordPress has an interesting idea for doing upgrades and downloading plugins. It took some getting used to until I learned to embrace it. I needed to understand the security considerations. Now I have a much better handle on it and feel comfortable with it.

First thing after installing WordPress, Murphy’s law you know, I was presented with an important security upgrade the very next day. I did the upgrade the hard way, doing all the file manipulation by hand. Copying files here and there, etc. I run the web server with a different user than the owner of the HTML documents to make things more secure. So I naively figured there was no way WordPress’s offer of automatically updating my installation would be possible in my case. After all all it could do was to run with the permissions of the web server, which as I say doesn’t have permissions to write to the relevant parts of the filesystem, right?

Then I learned that my colleagues on the Newton Robotics Team were managing to do it under the same conditions, so it piqued my curiosity. The next plugin I wished to install, WP-Syntax, offered me the same possibility of automatically installing it from the WordPress Admin GUI. It suggested that all I needed was to enter FTP credentials or use FTP/SSL. It did not explain how those credentials were going to be used, and I feared that they would be shared with another site.  Let’s think about this (this is how an IT person thinks).  There are two main possibilties. 1) The FTP client is initiated from an external site, probably where the repository where the plugin is housed, e.g., wordpress.org.  It was my gut feeling that was the case.  2) that the FTP client is on my local server where I run WordPress.  But, huh, what’s the point of that?

Turns out that 2) is what’s happening.  But then what is the point and how does it work?  By reverse engineering and reasoning, it must work as follows.  WordPress must download the plugin from the distribution site, perhaps through HTTP or FTP.  Perhaps it uses the FTP proxy feature where an intermediate can have an FTP connection to twp FTP servers and transfer files between them.  To expand it and put it into the local WordPress plugins directory, where the web server doesn’t have permissions to write, it definitely has to use FTP, but you gave it the credentials of the account that does have permissions to write to the plugins directory!  Clever, huh?  Of course this presupposes something.  Maybe if I read the WordPress requirements I would see that running an FTP server is strongly recommended. But I didn’t so this is another lesson learned through the school of hard knocks!  You see,  Ubuntu server and I think most linux distributions do not even bother to give you an FTP server.  Without a local FTP server WordPress cannot pull off its trick.  I’m not sure why they cannot use sftp, which is pretty universal these days.  In Ubuntu, you have the FTP client, but not the server.

I tried to run ftpd on my server to see what I would get.  It was missing and several packages which provide it were mentioned.  I chose inetutils-ftpd:  sudo apt-get install inetutils-ftpd.  I quickly learn that it relies on inetd, which I see I am not even running.  But it also has the option to run as a daemon: ftpd -D, which I chose to do (it won’t start after reboot without more jiggering, but I can start it by hand as I don’t need it often).

But how do I test my new FTP server?  Will it really work when WordPress tries to use it?  

Feb 2012 Update
I am now comfortable with directing WordPress to do my upgrade. I got tired of it bugging me about the 3.3.1 release so I relented and upgraded to it. I learned how to backup my database first, which is when I saw it was dominated by all the spam and scams I have been receiving. So I went back to the dashboard, got rid of 600 spam comments and re-ran the database mysqldump. The database dump file reduced in size from 10 MB to 3 MB! So it was 70% spam. Great people out there, huh? But I digress. I temporarily enabled my FTP daemon as described above and all went fine.

Then I enabled simple captcha challenge for POSTers. For now simple math seems to be flummoxing the auto-scam submitters! Next day my instance died. No idea why…

Categories
IT Operational Excellence Network Technologies

Swapping Servers while Preserving IPs – What Might Go Wrong

The Setup

I had this experience last week. I needed to swap a virtual server in place of a physical server. I had all the access I needed on both servers to do the necessary network changes, which is how I customarily do these things. I implement network configuration changes as opposed to, say, plugging cables in and pulling others out.

The Issue

Anyways, this sounded straightforward enough.  The physical server had  a backup interface on a different segment – one that I could access from the backup interface of another server also on that backup segment (so that I wouldn’t disconnect myself as I was shutting down the primary interface).  So as I was saying, simple: shutdown the primary interface on the physical server, configure the VM’s intereface similar to the physical server, reboot the virtual server so the interface changes take effect and can be seen to be working even after a reboot.  But it didn’t work, or more precisely, it half-worked.  Why?

A Trail of Hints

Here’s what I didn’t yet say that turns out that has a significant role though I did not know it at the time.  See, that interface had two IPs defined, a primary and a virtual, I’ll call it secondary since virtual is a loaded term, IP, both on the same segment, i.e., eth0 and eth0:ns2v.  After the switch eth0 was working OK, but eth0:ns2v was not!  I also need to mention how they are used, from a network perspective, to see if you are following the hints and can guess what the problem might be before I spell it out.  I have different DNS servers bound to these interfaces.  They are resolving DNS servers.  It actually does not matter (another hint!) but the OS is SLES 11.

Final hint: eth0 probably took a few minutes to work, eth0:ns2v was not working even after 17 minutes.  By not working I mean that I could see the interface on the VM come up OK, my DNS server was bound to it and I could send it queries from the VM itself.  But queries from servers on other segments to this secondary were not being returned.  I tried a trace on the VM: tcpdump -i eth0:ns2v  (OK. I lied.  More hints on the way.  This is how you solve such problems!), while doing a PING from a server on another segment. Nothing coming in.  PINGs and DNS queries to primary interface were coming in fine however.  So I know I had my routing correct.

Biggest hint of all: I could PING this secondary interface from another server on the same segment!

So what the heck is going on here?  And it’s late at night of course so no one is disturbed by this change.  I begin to suspect the router.  After all, everything else is good, right?

Do I bother the network guy to fix his router?  For me that’s akin to admitting failture to plan.  So, no, I don’t want to.  That secondary interface isn’t that important.  it could wait until morning.  But it nags at me…

First Inkling

Then it hits me.  The Aha moment.  Let me back up.  Like I said I become convinced that the router is simply wrong.  But it’s one device I do not have any administrative access to.  What do I mean by “wrong” from a network engineer’s point-of-view?  I became convinced that its ARP table hadn’t aged out its entry for the secondary IP as it ought to have.  All hosts maintain an ARP table which stores the correspondence between IP address and MAC address of other devices on the same segment.  It’s how a device “knows” to talk to the right device when an application specifies an IP address – by correctly converting it into a MAC address.  So, you see, I preserved IPs.  But what if  the router held onto the old MAC address for the secondary IP?  It would try to send traffic destined for that IP which came in from other segments to the wrong place, or no place at all, since the old MAC was now offline.  I’m not exactly sure what happens to those packets.  I’d have to investigate and think about it some more.  Could be they get sent out via the switch but dropped by the switch as there’s no place to deliver them.

But the one IP, the main one, was working.  If you can’t solve what’s wrong, it’s a good idea to review what’s gone right amongst the things which are closely related.  And try to understand the difference in the two cases.

Aha Moment

That was the real Aha moment.  A server is always doing a bit of communication.  This and that chatter.  I realized the router was seeing some of that, and that it was all coming from the main IP.  Why? Because that’s just how things work in IPv4.  Usually.  So it made some sense that the router would update the ARP entry of the main IP.  After all it was seeing these packets come to it which contained the new MAC address/old IP address.  So it probably “knew” to update its ARP table with the new MAC from those packets.  But it wasn’t getting any packets that contained the new MAC address/old secondary IP address combination!  Knowing this situation, you would hope, as a reasonable person, that there would be an ARP table timer on all the ARP entries and they would simply age out and be renewed from time-to-time to prevent just such a situation.  You would hope, but it wasn’t happening.  17 minutes is a long time.  I later learned that this was an old router.  Supposedly it has an ARP timer of five or ten minutes.  But I know that isn’t correct. 

But I did not know any of that at the time.  I knew the main interface worked, the secondary didn’t.   Packets were streaming out of the primary to the router, no packets were streaming from the secondary to the router.  So I asked myself: how can I send packets from the secondary interface??  How do you do that?

I suspected two ways offhand.  I’m sure there are lots of others.  I suspected PING could do it.  Check the man pages.  Yup. ping -I interface_address.  Bingo.  I decided to ping the router, which is, of course, my default gateway, with the secondary IP as source.  The packets were returned.  Good.  Then I noticed that my monitors were completing.  I checked seconds later.  Sure enough, I could now reach that secondary IP from other segments.  Yeah!  Problem resolved.

Mystery solved, and no cold call to the networking group required.

Tying Up the Loose Ends

What would I have asked for if I had called the networking group?  I would have told them I suspect the ARP table on their router was not updating and could they please delete the ARP entry for that secondary IP, that’s what.  That’s what I would have done right away myself if I had had that kind of access.  On *ix devices there is usually a command like arp -d ip_address to delete a specific ARP entry. 

This also explains why another device on that segment could see that secondary IP while at the same time the router couldn’t.  The other device obviously had a more well-behaved ARP time-out mechanism.  Or perhaps it  didn’t but it had had no ARP entry for that secondary IP until after the server switch?  And of course the way modern switches work the traffic is all directed and carved up.  So the communication between those two devices, which would have contained nice and uptodate MAC/IP entries was completely segregated and none of it would have been seen by the router, so in that sense wasn’t helping any.  And what was the other way to send packets from a specific IP?  dig.  I use dig constantly in my capacity as a DNS admin, so I was aware it also allows you to specify your source IP address (dig -b).  Another way that most people would have thought of?  nmap.  I haven’t really checked, but I’m willing to bet nmap could easily also have been used.  But that’s kinf of a “nasty” utility and actually isn’t normally available on self-respecting servers.  It certainly wasn’t on this one.  sendmail MTA could also be used for this same purpose (setting the source IP), but that’s a pain in the rear to set up.  As I say there are probably lots of other utilities that do this.  nc or netcat, depending on your version of Linux, may also be promising.  The aspiring programmers could write a simple PERL (or pick your language) client to do the same thing, etc.   I now see that even telnet allows you to specify your source IP with the -b switch.  So it seems to be a fairly common feature – though not universal, just try to find it on an FTP client – in most networking utilities.

An IT person benefits from having lots of tools which accomplish the same things in different ways.

More Details As Time Permits