Categories
Admin Network Technologies

DIY Home Power Monitoring Solution – Perfect for Sandy

Intro
My recent experience losing power thanks to Sandy has gotten me thinking. How can I know when my power’s on? Or what if it gets shut off again, which by the way actually happened to me? I realized that I have all the pieces in place already and merely need to take advantage of the infrastructure already out there.

The details
I started with:

  • a working smartphone
  • an enterprise monitoring system
  • a SoHo router in my home

And that’s pretty much it!

The smartphone doesn’t really matter, as long as it receives emails. I guess a plain old cellphone would work just as well – the messages can be sent as text messages.

For enterprise monitoring I like HP SiteScope because it’s more economical than hard-core systems. I wrote a little about it in a previous post. Nagios is also commonly used and it’s open source, meaning free. Avoid Zabbix at all costs. Editor’s note: OK. I’ve changed my mind about Zabbix some seven years later. It’s still confusing as heck, but now I’m using it and I have to admit it is powerful. See this write-up.

A good SoHo router is Juniper SSG5. It extends the enterprise LAN into the home. You can carve up a Juniper router like that and provide a Home network, Work network, Home wireless and Work wireless. It’s great!

The last requirement is the key. The SoHo router at my house is always on, and so the enterprise LAN is always available, as long as I have power. Get it? I defined a simple PING monitor in SiteScope to ping my SoHo router’s WAN interface, which has a static private IP address on the enterprise LAN. If I can’t ping it, I’ve lost power, and use my monitoring system to send an email alert to my smartphone. When, or in the case of Sandy’s interminable outage, if, power ever comes back on, I send another alert letting me know that as well. If you’re not using SiteScope make sure you send several PINGs. A PING can be lost here or there for various reasons. siteScope sends out five PINGs at a regular interval, as a guideline.

Alternative
Of course if I had a business-class DSL or cable modem service with a dedicated IP I could have just PINGed that, but I don’t. With regaulr consumer grade service your IP can and will change from time to time, and using a dynamic DNS protocol (like dyndns) to mask that problem is a bit tricky.

Yes, if you call the power company they offer to call you back when your power is restored, but I like my monitor better. It tells me when things go off as well as on.

Conclusion
This is my necessity-is-the-mother-of-invention moment. Thank you, Sandy!

Categories
Linux Network Technologies SLES TCP/IP

Ethernet Bridging on the cheap. Fail. Then Success with OLTV

Intro
Some experiments just don’t work out. I became curious about a technology that has various names: ethernet bridging, wide-area VLANs, OTV, L2TP, etc. It looked like it could be done on the cheap, but that didn’t pan out for me. But later on we got hold of high-end gear that implements OTV and began to get it to work.

The details
What this is is the ability to extend a subnet to a remote location. How cool is that? This can be very useful for various reasons. A disaster recovery center, for instance, which uses the same IP addressing. A strategic decision to move some, but not all equipment on a particular LAN to another location, or just for the fun of it.

As with anything truly useful there is an open source implementation(s). I found openvpn, but decided against it because it had an overall client/server description and so didn’t seem quite what I had in mind. Openvpn does have a page about creating an ethernet bridging setup which is quite helpful, but when you install the product it is all about the client/server paradigm, which is really not what I had in mind for my application.

Then I learned about Astaro RED at the Amazon Cloud conference I attended. That’s RED as in Remote Ethernet Device. That sounded pretty good, but it didn’t seem quite what we were after. It must have looked good to Sophos as well because as I was studying it, Sophos bought them! Asataro RED is more for extending an ethernet to remote branch offices.

More promising for cheapo experimentation, or so I thought at the time, is etherip.

Very long story short, I never got that to work out in my environment, which was SLES VM servers.

What seems to be the most promising solution, and the most expensive, is overlay transport virtualization (OLTV or simply OTV), offered by Cisco in their Nexus switches. I’ll amend this post when I get a chance to see if it worked or not!

December Update
OTV is beginning to work. It’s really cool seeing it for the first time. For instance, I have a server in South Carolina on an OTV subnet, IP 10.94.45.2. Its default gateway is in New Jersey! Its gateway is in the ARP table, as it has to be, but merely to PING the gateway produces this unusual time lag:

> ping 10.94.45.1

PING 10.94.45.1 (10.194.54.33) 56(84) bytes of data.
64 bytes from 10.94.45.1: icmp_seq=1 ttl=255 time=29.0 ms
64 bytes from 10.94.45.1: icmp_seq=2 ttl=255 time=29.1 ms
64 bytes from 10.94.45.1: icmp_seq=3 ttl=255 time=29.6 ms
64 bytes from 10.94.45.1: icmp_seq=4 ttl=255 time=29.1 ms
64 bytes from 10.94.45.1: icmp_seq=5 ttl=255 time=29.4 ms

See those response times? Huge. I ping the same gateway from a different LAN but same server room in New Jersey and get this more typical result:

# ping 10.94.45.1

Type escape sequence to abort.
Sending 5, 64-byte ICMP Echos to 10.94.45.1, timeout is 2 seconds:
!!!!!
Success rate is 100 percent (5/5), round-trip min/avg/max = 1/0/1 ms
Number of duplicate packets received = 0

But we quickly stumbled upon a gotcha. Large packets were killing us. The thing is that it’s one thing to run OTV over dark fiber, which we know another customer is doing without issues; but to run it in an MPLS network is something else.

Before making any adjustment on our servers we found behaviour like the following:
– initial ssh to linux server works OK; but session soon freezes after a directory listing or executing other commands
– pings with the -s parameter set to anything greater than 1430 bytes failed – they didn’t get returned

So this issue is very closely related to a problem we observed on a regular segment where getvpn had just been implemented. That problem, which manifested itself as occasional IE errors, is described in some detail here.

Currently we don’t see our carrier being able to accommodate larger packets so we began to see what we could alter on our servers. On Checkpoint IPSO you can lower the MTU as follows:

> dbset interface:eth1c0:ipmtu 1430

The change happens immediately. But that’s not a good idea and we eventually abandoned that approach.

On SLES Linux I did it like this:

> ifconfig eth1 mtu 1430

In this platform, too, the change takes place right away.

By the we experimented and found that the largest MTU value we could use was 1430. At this point I’m not sure how to make this change permanent, but a little research should show how to do it.

After changing this setting, our ssh sessions worked great, though now we can’t send pings larger than 1402 bytes.

The latest problem is that on our OTV segment we can ping only one device but not the other.

August 2013 update
Well, we are resourceful people so yes we got it running. Once the dust settled OTV worked pretty well, with certain concessions. We had to be able to control the MTU on at least one side of the connection, which, fortunately we always could. Load balancers, proxy servers, Linux servers, we ended up jiggering all of them to lower their MTU to 1420. For firewall management we ended up lowering the MTU on the centralized management station.

Firewalls needed further voodoo. After pushing policy clamping needs to be turned back on and acceleration off like this (for Checkpoint firewalls):

$ fw ctl set int fw_clamp_tcp_mss 1
$ fwaccel off

Conclusion
Having preserved IPs during a server move can be a great benefit and OTV permits it. But you’d better have a talented staff to overcome the hurdles that will accompany this advanced technology.

Categories
Admin Network Technologies

The IT Detecive Agency: FTPs and other things going slow to one part of one location

Intro
What happens when you have a slowly festering problem – slow but tolerable performance? Is it a problem when no one complains but you know the numbers don’t feel right? Read on to see what happens in this exciting adventure.

The details
I always felt that one of my locations, let’s call it Scattering, was just hard-to-explain slow. Web site accesses as measured in HP SiteScope was slower than at another site, lets call it Cooper, and the variation seemed larger. You could always chalk it up to the fact that the SiteScope server was in the data center with the good performance, but there seemed more to it than that.

The situation was, apparently, tolerable and people grew accustomed to it. Until one day someone said that it just wasn’t right. Ping times from his site, let’s call it Vanderbilt, looked like this from his desktop:

C:\Users>ping dmz-host
 
Pinging dmz-host [10.93.23.12] with 32 bytes of data:
Reply from 10.93.23.12: bytes=32 time=80ms TTL=248
Reply from 10.93.23.12: bytes=32 time=107ms TTL=248
Reply from 10.93.23.12: bytes=32 time=74ms TTL=248
Reply from 10.93.23.12: bytes=32 time=78ms TTL=248

And yet, ping times to another piece of equipment in that same physical server room looked much healthier – on order of 22 msec with very little jitter. What the…?

So I started looking around at my equipment. My equipment generally had good response times amongst itself, but when it crossed over to another group’s equipment it started to go up. Finally I saw it – a common denominator. My equipment had its gateway supplied by a different group. PINGing the local gateway gave 15 msec response times, with high jitter! I’ve never seen that before. You normally expect to get response times , 1 msec. Time to turn the problem over to them.

But as a former scientist, I like to have a hypothesis about what might be wrong. I thought a bad cable, because I’ve heard of that. Duplex mis-matches on the port are always a strong possibility. Guess what? It wasn’t any of those things. But it was something about the port on their equipment. What do you suppose it was? Hint: the equipment they were using was leftover stuff that needed to be replaced but hadn’t because it just kept working.

Mystery Resolved
The port was saturated! It was older equipment with 100 mbit ports, the amount of traffic slowly built up on it over time and sort of crept up on us all. I guess finally the demand on it was causing noticeable problems in response times that someone finally complained!

He shifted the gateway to a 1 gbit port and things got much better. Here are those PINGs now:

C:\Users>ping 10.93.23.12
 
Pinging 10.93.23.12 with 32 bytes of data:
Reply from 10.93.23.12: bytes=32 time=24ms TTL=248
Reply from 10.93.23.12: bytes=32 time=23ms TTL=248
Reply from 10.93.23.12: bytes=32 time=23ms TTL=248
Reply from 10.93.23.12: bytes=32 time=23ms TTL=248
 
Ping statistics for 10.93.23.12:
    Packets: Sent = 4, Received = 4, Lost = 0 (0% loss),
Approximate round trip times in milli-seconds:
    Minimum = 23ms, Maximum = 24ms, Average = 23ms

That’s much better!

The HP SiteScope timing are more subtle, but no less important. Users don’t tolerate slow-loading web pages, you know!

I need to make an aside here. Is there no free and widely distributed Linux package to calculate the standard deviation of a set of numbers? I can’t believe it. I had to borrow this guy’s shell script, and it’s pretty lousy because it’s really slow: http://panoskrt.wordpress.com/2009/03/10/shell-script-for-standard-deviation-arithmetic-mean-and-median/. Anyways, doing a before and after analysis on my URL timing, where I fetched a Google page plus images during the daytime from 10 AM to 4 PM, I find this before and after result:

Before

Number of data points in “sca” = 144
Arithmetic mean (average) = .630548611
Standard Deviation = .225365467
Median number (middle) = .664500000

After

Number of data points in “sca2” = 144
Arithmetic mean (average) = .349270833
Standard Deviation = .162399076
Median number (middle) = .263000000

Users were very happy after the upgrade.

Conclusion
An old gateway port that maxed out its traffic capacity was to blame for performance problems in one part of a data center. This doesn’t happen too often, but it can happen, obviously.

Linux needs a simple math package to allow to calculate the standard deviation of a set of numbers.

Categories
DNS IT Operational Excellence Network Technologies

The IT Detective Agency: Roku player can’t connect to Internet

Intro
I bought a used Roku player, just to try it out. Things didn’t go right at first.

The details
Setting it up, I got to the point where it tests its Internet connection. There are three status lights. The first two were OK, but the Internet connection returned an error, I think error #9. It didn’t matter whether I used a wired or wireless connection.

I have an unusual Internet router at home, a Juniper SSG5. All my other devices off that router were getting to Internet just dandy however. Long story short, it turns out that there was a slight DNS misconfiguration. The Juniper had itself as secondary DNS server. There was no primary DNS server. Why this only affected Roku is still a mystery.

I updated the Juniper config to use 4.2.2.3 as primary DNS server and the Roku connected just fine and it worked wirelessly as well.

Case closed!

Conclusion
Well, as a network type-of-guy, I would have appreciated access to the Roku Linux OS so I could have seen for myself what was going on. I’m sure I would have figured out the problem much sooner than I did. The error message was not particularly helpful and most of the online discussions attribute that problem to other causes than was the DNS issue discovered here. Short of OS access, more robust debugging tools, such as PING, would have been might handy.

Categories
Admin Network Technologies

The IT Detective Agency: Two of our sites got cut off!

Intro
I sometimes consult for the networking group of a large company. This incident really happened. I don’t know that it could ever happen again to anyone else, but it’s so bizarre that I just had to document it as an example of “you wouldn’t believe it unless you had actually been through it yourself.”

Let’s get into it
This company has lots of small and mid-sized offices connected via MPLS. WAN services are provided by a single telecom throughout the country. I feel obliged to not divulge specifics here. Let’s call the telecom “OE” as in over-extended.

So just before lunch yesterday they tell me that no PCs at one of their sites can access Internet, and this information is coming from a very reliable source. It also comes out that a second site is similarly affected. It kind of sounds like a WAN problem, but no other sites are affected. In the old days you’d almost certainly know to look at the WAN, but these days it’s a little more complicated. Everyone’s PC is in AD and they have the ability to push a GPO to all PCs at a site, so you just never know if the desktop group wasn’t involved in messing them up.

So they tell me they can PING their corporate Intranet server. Fine. But they cannot telnet to port 80. Newsflash. How did they get telnet enabled in Windows 7? I mentally stored this question for my continuing education. Crises are also great learning/teaching moments if you are of that frame-of-mind!

Ping is good. Of course I test the Intranet server myself, iwww.intranet. I can reach port 80 just fine. It happens that the front-end for iwww.intranet is a load balancer. I decide to do a trace using tcpdump. I’m not sure what I’ll find, but taking a trace is sort of a gut reaction in these cases. There’s lots of other traffic so we have to use a filter to see the tree in the forest. They give me the IP of the PC they’re testing from. My expression is something like this:

> tcpdump -i 0.0 host WKSTATION.AD.INTRANE

The 0.0 on this particular device is its way of saying use any and all interfaces.

Here’s what the output looks like:

11:51:03.852511 IP WKSTATION.AD.INTRANET > iwww.intranet: ESP(spi=0xd7ca145d,seq=0x3de0a5), length 100
11:51:05.855446 IP WKSTATION.AD.INTRANET > iwww.intranet: ESP(spi=0xd7ca145d,seq=0x3de101), length 100
11:51:06.187940 IP WKSTATION.AD.INTRANET > iwww.intranet: ESP(spi=0xd7ca145d,seq=0x3de10b), length 100
11:51:08.857957 IP WKSTATION.AD.INTRANET > iwww.intranet: ESP(spi=0xd7ca145d,seq=0x3de16d), length 100
11:51:09.184072 IP WKSTATION.AD.INTRANET > iwww.intranet: ESP(spi=0xd7ca145d,seq=0x3de179), length 100
11:51:09.858865 IP WKSTATION.AD.INTRANET > iwww.intranet: ESP(spi=0xd7ca145d,seq=0x3de198), length 100
11:51:14.855327 IP WKSTATION.AD.INTRANET > iwww.intranet: ESP(spi=0xd7ca145d,seq=0x3de272), length 100
11:51:15.183349 IP WKSTATION.AD.INTRANET > iwww.intranet: ESP(spi=0xd7ca145d,seq=0x3de27f), length 100
11:52:19.898380 IP WKSTATION.AD.INTRANET > iwww.intranet: ESP(spi=0xd7ca145d,seq=0x3decab), length 100
11:52:22.901684 IP WKSTATION.AD.INTRANET > iwww.intranet: ESP(spi=0xd7ca145d,seq=0x3ded28), length 100

I had never seen that before. I loop up ESP and see it is related to IP protocol 50 which in turn is used in IPSEC for VPN connections.

What the…?

Yeah. Let’s review. He’s sending TCP packets to port 80 of iwww.intranet and all I’m seeing are these ESP packets, which isn’t even TCP. Of course the load balancer has no idea whatsoever what to do with those packets and simply does not respond to them.

What would you do? I felt the nail in the coffin would be to take a trace on the PC itself – see how the packets are when they’re coming straight out of the PC. But to be honest they never did make that trace. They didn’t have Wireshark installed so it would take awhile.

Meanwhile, the infrastructure folks are talking to each other and someone mentions the OE has a certain project “runVPN” that thay’re rolling out. Now that sounds suspicous. In the imperfect world you have to work with what you have, not what you’d like to have. Based on our experiences and educated hunches, we now feel pretty certain it’s gotta be a WAN problem caused by OE.

Within an hour OE confirms the problem is of their creation, and they have it fixed. They are very unhappy with the tech who caused it.

Conclusion
Sometimes things are what they appear to be. If you notice I didn’t do much to really help with the issue, and that’s all just about anyone can do when so much is outsourced. They did feel that my trace helped to convince the telecom that this really was their issue.

I guess they were encrypting WAN traffic on one end, but forgot to decrypt it on the other end. One of the strangest things to have on a production network.

Turning on telnet in Windows 7
Did I mention that they tested with telnet on Windows 7? They later explained how to enable it. Go to control panel / Programs / Programs and Features / Turn Windows features on. There is an option for Telnet client. Reboot (yes, you really need to reboot for this to take effect) and you’re good to go.

Categories
Admin Linux Network Technologies Raspberry Pi

Compiling hping on CentOS

Intro
hping was recommend to me as a tool to stage a mock DOS test against one of our servers. I found that I did not have it installed on my CentOS 6 instance and could not find it with a yum search. I’m sure there is an rpm for it somewhere, but I figured it would be just as easy to compile it myself as to find the rpm. I was wrong. It probably was a _little_ harder to compile it, but I learned some things in doing so. So I’ll share my experience. It wasn’t too bad. I have nothing original to add here to what you find elsewhere, except that I didn’t find anywhere else with all these problems documented in one place. So I’ve produced this blog post as a convenient reference.

I’ve also faced this same situation on SLES – can’t find a package for hping anywhere – and found the same recipe below works to compile hping3.

The Details
I downloaded the source, hping3-20051105.tar.gz, from hping.org. Try a ./configure and…

error can not find the byte order for this architecture, fix bytesex.h

After a few quick searches I began to wonder what the byte order is in the Amazon cloud. Inspired I wrote this C program to find out and remove all doubt:

/* returns true if system is big_endian. See http://unixpapa.com/incnote/byteorder.html - DrJ */
#include<stdio.h>
 
main()
{
    printf("Hello World");
    int ans = am_big_endian();
    printf("am_big_endian value: %d",ans);
 
}
 
int am_big_endian()
  {
     long one= 1;
     return !(*((char *)(&one)));
  }

This program makes me realize a) how much I dislike C, and b) how I will never be a C expert no matter how much I dabble.

The program returns 0 so the Amazon cloud has little endian byte order as we could have guessed. All Intel i386 family chips are little endian it seems. Back to bytesex.h. I edited it so that it has:

#define BYTE_ORDER_LITTLE_ENDIAN
/* # error can not find the byte order for this architecture, fix bytesex.h */

Now I can run make. Next error:

pcap.h No such file or directory.

I installed libpcap-devel with yum to provide that header file:

$ yum install libpcap-devel

Next error:

net/bpf.h no such file or directory

For this I did:

$ ln -s /usr/include/pcap-bpf.h /usr/include/net/bpf.h

TCL
Next error:

/usr/bin/ld: cannot find -ltcl

I decided that I wouldn’t need TCL anyways to run in simple command-line fashion, so I excised it:

./configure --no-tcl

Then, finally, it compiled OK with some warnings.

hping3 for Raspberry Pi
On the Raspberry Pi it was simple to install hping3:

$ sudo apt-get install hping3

That’s it!

Raspberry Pi’s are pretty slow to generate serious traffic, but if you have a bunch I suppose they could amount to something in total.

Conclusion
Now I’m ready to go to use hping3 for a SYN_FLOOD simulated attack or whatever else we want to test.

Categories
Admin Network Technologies

The IT Detective Agency: virus updates are failing

Intro
This case hardly qualifies as worthy subject matter for the IT Detective Agency – it’s pretty run-of-the-mill stuff. But I wanted to document it for completeness and show how a problem in one thing can turn out to have an unexpected cause (At least to me. In hindsight it’s dead obvious what the issue was likely to have been).

The Situation
We have lots of servers at drjohns. So when one of our admins, Shake, said that one of them, nfuz01, can’t reach the Etrust serverto get its virus updates I had no recollection of what that server is or does. Shake asked if the firewall had changed recently. That’s sort of a tricky question because there are always minor changes being done. Most have absolutely no effect because they are additional rules providing new permissions. So I bravely answered No, it hadn’t. And I wondered what he meant in using the word “reach” anyways.

So I walk up to Shake’s desk to get a better idea. He said not only are updates not virus signature updates not occurring, but neither server can PING the other, neither by name nor by IP address. Now we’re getting somewhere. I still haven’t registered where nfuz01 is, but I know the firewall as I’ve set it up permits ICMP traffic to transit. I suggested that maybe nfuz01 had some missing or messed-up routes. Then I went back to my desk to think some more. That’s what gets me motivated – when I’ve publicly speculated about the root cause of something. It’s not so much that I may be proven wrong, but if I am wrong, I want to be the first to find out and issue a correction.

So I tried a PING from my desktop:

C:\>ping 10.91.12.14
 
Pinging 10.91.12.14 with 32 bytes of data:
Reply from 171.18.252.10: TTL expired in transit.
Reply from 171.18.252.10: TTL expired in transit.

I look up where nfuz01 is. It is in a secondary data center. I ping it from a server in that same data center, but one a different segment – it works fine! I ping it from a Linux server in my main data center – totally different results:

> ping 10.91.12.14
PING 10.91.12.14 (10.91.12.14) 56(84) bytes of data.
From 171.18.252.10 icmp_seq=1 Time to live exceeded
From 171.18.252.10 icmp_seq=2 Time to live exceeded
 
--- 10.91.12.14 ping statistics ---
2 packets transmitted, 0 received, +2 errors, 100% packet loss, time 1000ms
 
> traceroute -n 10.91.12.14
traceroute to 10.91.12.14 (10.91.12.14), 30 hops max, 40 byte packets
 1  10.136.188.2  1.100 ms  1.523 ms  2.010 ms
 2  10.1.4.2  0.934 ms  0.941 ms  0.773 ms
 3  10.1.4.141  0.869 ms  0.926 ms  0.913 ms
 4  171.18.252.10  1.076 ms  1.096 ms  1.130 ms
 5  10.1.4.129  1.043 ms  1.029 ms  1.018 ms
 6  10.1.4.141  0.993 ms  0.611 ms  0.918 ms
 7  171.18.252.10  0.932 ms  0.916 ms  1.002 ms
 8  10.1.4.129  0.987 ms  1.089 ms  1.121 ms
 9  10.1.4.141  1.152 ms  1.246 ms  1.229 ms
10  171.18.252.10  2.040 ms  2.747 ms  2.735 ms
11  10.1.4.129  1.332 ms  1.418 ms  1.467 ms
12  10.1.4.141  1.477 ms  1.754 ms  1.685 ms
13  171.18.252.10  1.993 ms  1.978 ms  2.013 ms
14  10.1.4.129  1.930 ms  1.960 ms  2.039 ms
15  10.1.4.141  2.065 ms  2.156 ms  2.140 ms
16  171.18.252.10  2.116 ms  5.454 ms  5.453 ms
17  10.1.4.129  4.466 ms  4.385 ms  4.296 ms
18  10.1.4.141  4.266 ms  4.267 ms  4.260 ms
19  171.18.252.10  4.232 ms  4.216 ms  4.216 ms
20  10.1.4.129  4.182 ms  4.063 ms  4.009 ms
21  10.1.4.141  3.994 ms  3.987 ms  2.398 ms
22  171.18.252.10  2.400 ms  2.484 ms  2.690 ms
23  10.1.4.129  2.346 ms  2.449 ms  2.544 ms
24  10.1.4.141  2.534 ms  2.607 ms  2.610 ms
25  171.18.252.10  2.602 ms  2.742 ms  2.736 ms
26  10.1.4.129  2.776 ms  2.856 ms  2.848 ms
27  10.1.4.141  2.648 ms  3.185 ms  3.291 ms
28  171.18.252.10  3.236 ms  3.223 ms  3.235 ms
29  10.1.4.129  3.219 ms  3.277 ms  3.377 ms
30  10.1.4.141  3.363 ms  3.381 ms  3.449 ms

Cool, right? We’ve caught a network loop in the act. Now I know it isn’t the firewall, it isn’t the routes on nfuz01 but it is something with networking. So I sent that off to them….

In less than an hour I got the explanation as well as the fix:


All should be reachable again. There’s a loop I can’t clear amongst some [telecom-owned] routers in the main data center. I’ve superseded it with two /27s until they clear it.

And it pings fine now:

> ping 10.91.12.14
PING 10.91.12.14 (10.91.12.14) 56(84) bytes of data.
64 bytes from 10.91.12.14: icmp_seq=1 ttl=125 time=46.2 ms
64 bytes from 10.91.12.14: icmp_seq=2 ttl=125 time=40.1 ms
64 bytes from 10.91.12.14: icmp_seq=3 ttl=125 time=23.1 ms
 
--- 10.91.12.14 ping statistics ---
3 packets transmitted, 3 received, 0% packet loss, time 2000ms
rtt min/avg/max/mdev = 23.114/36.502/46.275/9.796 ms

And Shake says the updates came in.

Case closed!

Conclusion
Why wasn’t the problem more obvious to us from the very beginning? Well, if the admin who said nfuz01 couldn’t reach Etrust had tried to log in nfuz01 through the normal Remote Desktop mechanism – and of course failed – then we might have drilled down into a networking cause more quickly. But nfuz01 is a VM and he must have been logged on via VMWare Virtual Center and so he hadn’t noticed that basically the server couldn’t reach anywhere in our main data center. It is also an obscure server (remember that I had no recall about it?) so no one really noticed that it was effectivley out-of-business.

Categories
Admin IT Operational Excellence Network Technologies

The IT Detective Agency: the case of the Adobe form network issue

Intro
Sometimes IT is called in to fix things we know little or nothing about. We may fix it, still not know anything, except what we did to fix it, and move on. Such is the case here with a mysterious Adobe Form that wasn’t working when I was called in to a meeting to discuss it.

The Case
One of our developers created a simple Adobe Acrobat form document. It has some logic to ask for a username and password and then verify them against an LDAP server using a network connection. Or at least that was the theory. It worked fine and then we got new PCs running Windows 7 and it stopped working. They asked me for help.

I asked to get a copy of the form because I like to test on my desktop where I am free to try dumb things and not waste others’ time. Initially I thought I also had the error. They showed me how to turn on Javascript debugging in edit|preferences. The debug window popped up with something like this:

debug 5 : function setConstants
debug 5 : data.gURL = https://b2bqual.drjohnstechtalk.com/invoke/Drjohns_Services.utilities:httpPostToBackEnd
debug 5 : function today_ymd
debug 5 : mrp::initialize: version 0.0001 debug level = 5
debug 5 : Login clicked
debug 5 : calling LDAPQ
debug 5 : in LDAPQ
 
NotAllowedError: Security settings prevent access to this property or method.
SOAP.request:155:XFA:data[0]:mrp[0]:sub1[0]:btnLogin[0]:clic

But this wasn’t the real problem. In this form you get a yellow bar at the top along with this message and you give approval to the form to access what it needs. Then you run it again.

For me, then, it worked. I knew this because it auto-populated some user information fields after taking a few seconds.

So i worked with a couple people for whom it wasn’t working. One had Automatically detect proxy settings checked. Unfortunately the new PCs came this way and it’s really not what we want. We prefer to provide a specific PAC file. With the auto-detect unchecked it worked for this guy.

The next guy said he already had that unchecked. I looked at his settings and confirmed that was the case. However, in his case he mentioned that Firefox is his default browser. He decided to change it back to Internet Explorer. Then he tested and lo and behold. It began to work for him as well!

When it wasn’t working he was seeing an error:

NetworkError: A network connection could not be created.

Later he realized that in Firefox he also was using auto-detect for the proxy settings. When he switched that to Use System Settings all was OK and he could have FF as default browser and get this form to work.

Conclusion
This is speculation on my part. I guess that our new version of Acrobat Reader X, v 10.1.1, is not competent in interpreting the auto-detect proxy setting, and that it is also tripped up by the proxy settings in Firefox.

There’s a lot more I’d like to understand about what was going on here, but sometimes speed counts. The next problem is already calling my name…

Categories
Admin IT Operational Excellence Network Technologies

The IT Detective Agency: ARP Entry OK, PING not Working

Intro
Yes, the It detective agency is back by popular demand. This time we’ve got ourselves a thriller involving a piece of equipment – a wireless LAN controller, WLAN – on a directly connected network. From the router we could see the arp entry for the WLAN, but we could not PING it. Why?

A trace, or more correctly the output of tcpdump run on the router interface connected to that network, showed this:

>
12:08:59.623509  I arp who-has rtr7687.drjohnhilgarts.com tell wlan.drjohnhilgarts.com
12:08:59.623530  O arp reply rtr7687.drjohnhilgarts.com is-at 01:a1:00:74:55:12 (oui Nokia Internet Communications)
12:09:01.272922  I STP 802.1d, Config, Flags [none], bridge-id 2332.3c:df:1e:8f:2b:c0.8312, length 43
12:09:03.271765  I STP 802.1d, Config, Flags [none], bridge-id 2332.3c:df:1e:8f:2b:c0.8312, length 43
12:09:05.271469  I STP 802.1d, Config, Flags [none], bridge-id 2332.3c:df:1e:8f:2b:c0.8312, length 43
12:09:07.271885  I STP 802.1d, Config, Flags [none], bridge-id 2332.3c:df:1e:8f:2b:c0.8312, length 43
12:09:09.271804  I STP 802.1d, Config, Flags [none], bridge-id 2332.3c:df:1e:8f:2b:c0.8312, length 43
12:09:09.622902  I arp who-has rtr7687.drjohnhilgarts.com tell wlan.drjohnhilgarts.com
12:09:09.622922  O arp reply rtr7687.drjohnhilgarts.com is-at 01:a1:00:74:55:12 (oui Nokia Internet Communications)
12:09:11.271567  I STP 802.1d, Config, Flags [none], bridge-id 2332.3c:df:1e:8f:2b:c0.8312, length 43
12:09:13.271716  I STP 802.1d, Config, Flags [none], bridge-id 2332.3c:df:1e:8f:2b:c0.8312, length 43
12:09:15.271971  I STP 802.1d, Config, Flags [none], bridge-id 2332.3c:df:1e:8f:2b:c0.8312, length 43
12:09:17.040748  I b8:c7:5d:19:b9:9e (oui Unknown) > Broadcast Null Unnumbered, xid, Flags [Command], length 46
12:09:17.271663  I STP 802.1d, Config, Flags [none], bridge-id 2332.3c:df:1e:8f:2b:c0.8312, length 43
12:09:19.271832  I STP 802.1d, Config, Flags [none], bridge-id 2332.3c:df:1e:8f:2b:c0.8312, length 43
12:09:19.392578  I b8:c7:5d:19:b9:9e (oui Unknown) > Broadcast Null Unnumbered, xid, Flags [Command], length 46
12:09:19.623515  I arp who-has rtr7687.drjohnhilgarts.com tell wlan.drjohnhilgarts.com
12:09:19.623535  O arp reply rtr7687.drjohnhilgarts.com is-at 01:a1:00:74:55:12 (oui Nokia Internet Communications)
12:09:20.478397  O arp reply rtr7687.drjohnhilgarts.com is-at 01:a1:00:74:55:12 (oui Nokia Internet Communications)
12:09:21.271714  I STP 802.1d, Config, Flags [none], bridge-id 2332.3c:df:1e:8f:2b:c0.8312, length 43
12:09:23.271697  I STP 802.1d, Config, Flags [none], bridge-id 2332.3c:df:1e:8f:2b:c0.8312, length 43
12:09:25.271664  I STP 802.1d, Config, Flags [none], bridge-id 2332.3c:df:1e:8f:2b:c0.8312, length 43
12:09:27.272156  I STP 802.1d, Config, Flags [none], bridge-id 2332.3c:df:1e:8f:2b:c0.8312, length 43
12:09:29.271730  I STP 802.1d, Config, Flags [none], bridge-id 2332.3c:df:1e:8f:2b:c0.8312, length 43
12:09:29.621882  I arp who-has rtr7687.drjohnhilgarts.com tell wlan.drjohnhilgarts.com
12:09:29.621903  O arp reply rtr7687.drjohnhilgarts.com is-at 01:a1:00:74:55:12 (oui Nokia Internet Communications)
12:09:31.271765  I STP 802.1d, Config, Flags [none], bridge-id 2332.3c:df:1e:8f:2b:c0.8312, length 43
12:09:33.271858  I STP 802.1d, Config, Flags [none], bridge-id 2332.3c:df:1e:8f:2b:c0.8312, length 43

What’s interesting is what isn’t present. No PINGs. No unicast traffic whatsoever, yet we knew the WLAN was generating traffic. The frequent arp requests for the same IP strongly hinted that the WLAN was not getting the response. We were not able to check the arp table of the WLAN. And we knew the WLAN was supposed to respond to our PINGs, but it wasn’t. Yet the router’s arp table had the correct entry for the WLAN, so we knew it was plugged into the right switch port and on the right vlan. We also triple-checked that the network masks matched on both devices. Let’s go back. Was it really on the right vlan??

The Solution
What we eventually realized is that in the WLAN GUI, VLANs were assigned to the various interfaces. the switch port, on a Cisco switch, was a regular access port. We reasoned (documentation was scarce) that the interface was vlan tagging its traffic. So we tried to change the access port to a trunk port and enter the correct vlan. Here’s the show conf snippet:

interface GigabitEthernet1/17
 description 5508-wlan
 switchport
 switchport trunk encapsulation dot1q
 switchport trunk allowed vlan 887
 switchport mode trunk
 spanning-tree portfast edge trunk

Bingo! With that in place we could ping the WLAN and it could send us its traffic.

Case closed.

2018 update
I had totally forgotten my own posting. And I’ll be damned if in the heat of connecting a new firewall to a switch port we didn’t have this weird situation where we could see MAC entries of the firewall, and it could see MACs of other devices on that vlan, but nobody could ping the firewall and vica versa. A trace from tcpdump looked roughly similar to the above – a lot of arp who-has firewall, tell server. Sure enough, the firewall guy, new to the group, had configured all his ports to be tagged ports, even those with a single vlan. It had been our custom to make single vlans non-tagged ports. I didn’t start it, that’s just how it was. More than an hour was lost debugging…

And earlier in the year was yet another similar incident, where a router operated by a vendor joining one of our vlans assumed tagged ports where we did not. More than an hour was lost debugging… See a pattern there?

I had forgotten my own post from seven years ago to such an extent, I was just about to write a new one when I thought, Maybe I’ve covered that before. So old topics are new once again… Here’s to remember this for the next time!

Where to watch out for this
When you don’t run all the equipment. If you ran it all you’d have the presence of mind to make all the ports consistent.

Some terminology
A tagged port can also be known as using 802.1q, which is also known as dot1q, which in Cisco world is known as a trunk port. In the absence of that, you would have an access port (Cisco terminology) or untagged port (everyone else).

Conclusion
OK, there are probably many reasons and scenarios in which devices on the same network can see each other’s arp entries, but not send unicast traffic. But, the scenario we have laid out above definitely produces that effect, so keep it in mind as a possibility should you ever encounter this issue.

Categories
Admin IT Operational Excellence Network Technologies

Internet Service Providers Block TCP Port 22 or Do They?

Intro
The original premise of this article is that some Internet Service Providers were seen to block TCP port 22, used by ssh and sftp. However, as often happens during active IT investigations, this turns out to be completely wrong. In fact there was a block in this case we studied, but not by the ISPs. An overly aggressive ACL on the customer premise equipment Internet router is in fact the culprit.

The Problem
(IPs skewed to protect whatever) We asked a partner to do an sftp to drjohnstechtalk.com. All firewall and routing rules were in place. The partner tried it. He saw a SYN packet leaving, but no packets being returned. Here at drjohnstechtalk, we didn’t see any packets whatsoever! This partner makes sftp connections to other servers successfully. What the heck?

We had them try the following basic command:

nc -v host 22

where host is the IP of the target server. The response was:

nc: connect to host port 22 (tcp) failed: No route to host

But switching to port 21 (FTP) showed completely different behaviour: there was no message whatsoever and the session hanged. That’s good! That’s the usual firewalls dropping packets. But this No route to host needs more exploration.

Getting Closer
So we did an open trace. I mean a tcpdump without any limiting expression. The dump showed the SYN out to port 22, followed by this nugget:

13:09:14.279176 IP Sprint_IP > src_IP: ICMP host target_IP unreachable - admin prohibited filter, length 36

Next Steps
This well-intentioned filtering is causing a business problem. The Cisco IOS ACL that got them into trouble was this one:

ip access-list extended drop-spoof-and-telnet
 deny   tcp any any eq 22 log-input

Solution
They liked the idea of this filtering, but apparently this was the first request for inbound ssh access. So they decided to keep this filter rule but precede it with more specific rules as required, essentially acting like a second firewall:

ip access-list extended drop-spoof-and-telnet
 permit tcp host IP_src host IP_dest eq 22
  deny   tcp any any eq 22 log-input